Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 140]
cs.CV [Total: 138]
cs.AI [Total: 81]
cs.SD [Total: 18]
cs.LG [Total: 216]
cs.MA [Total: 4]
cs.MM [Total: 3]
eess.AS [Total: 4]
eess.IV [Total: 10]

cs.CL

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: The paper introduces a concept learning dataset to uncover implicit biases in LLMs, finding upward monotonicity bias in quantifiers through in-context learning experiments.

Details

Motivation: To discover hidden biases in large language models that may not be apparent through direct prompting methods.

Method: Used in-context concept learning experiments with a specially designed dataset to test language models’ implicit biases.

Result: Found that language models exhibit bias toward upward monotonicity in quantifiers, which is less detectable without concept learning components.

Conclusion: In-context concept learning is an effective method for revealing hidden biases in language models that traditional prompting approaches may miss.

Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[2] Towards Open-Ended Discovery for Low-Resource NLP

Bonaventure F. P. Dossou, Henri Aïdasso

Main category: cs.CL

TL;DR: This position paper advocates for a paradigm shift from static data collection to interactive, uncertainty-driven language discovery for low-resource languages, using human-machine collaboration and joint uncertainty signals.

Details

Motivation: NLP for low-resource languages is constrained by lack of textual corpora, standardized orthographies, and scalable annotation pipelines. Current large language models remain inaccessible to underrepresented communities due to reliance on massive pre-collected data and centralized infrastructure.

Method: Proposes a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention.

Result: This is a position paper presenting a conceptual framework rather than empirical results. The framework enables interactive, cooperative model building between AI systems and speakers.

Conclusion: Calls for rethinking how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while preserving linguistic diversity.

Abstract: Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world’s linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.

[3] SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models

Karan Dua, Puneet Mittal, Ranjeet Gupta, Hitesh Laxmichand Patel

Main category: cs.CL

TL;DR: SpeechWeave is a synthetic speech data generation pipeline that automates multilingual, domain-specific dataset creation for TTS training, addressing data scarcity, diversity, and normalization issues.

Details

Motivation: Real TTS training data is hard to obtain due to domain specificity, licensing, and scalability issues. LLMs generate repetitive text, normalization tools introduce anomalies, and voice artist recording is impractical for large-scale commercial systems.

Method: Proposed SpeechWeave pipeline for automated generation of multilingual, domain-specific synthetic speech datasets with improved text diversity and normalization.

Result: Generated data is 10-48% more diverse across linguistic/phonetic metrics, produces speaker-standardized speech audio, and achieves ~97% correctly normalized text.

Conclusion: The approach enables scalable, high-quality TTS data generation with improved diversity, normalization, and voice consistency.

Abstract: High-quality Text-to-Speech (TTS) model training requires extensive and diverse text and speech data. It is challenging to procure such data from real sources due to issues of domain specificity, licensing, and scalability. Large language models (LLMs) can certainly generate textual data, but they create repetitive text with insufficient variation in the prompt during the generation process. Another important aspect in TTS training data is text normalization. Tools for normalization might occasionally introduce anomalies or overlook valuable patterns, and thus impact data quality. Furthermore, it is also impractical to rely on voice artists for large scale speech recording in commercial TTS systems with standardized voices. To address these challenges, we propose SpeechWeave, a synthetic speech data generation pipeline that is capable of automating the generation of multilingual, domain-specific datasets for training TTS models. Our experiments reveal that our pipeline generates data that is 10-48% more diverse than the baseline across various linguistic and phonetic metrics, along with speaker-standardized speech audio while generating approximately 97% correctly normalized text. Our approach enables scalable, high-quality data generation for TTS training, improving diversity, normalization, and voice consistency in the generated datasets.

[4] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

Bertrand Kian Hassani, Yacoub Bahini, Rizwan Mushtaq

Main category: cs.CL

TL;DR: This paper uses fine-tuned LLMs to analyze climate disclosures from 828 U.S. firms, finding that while larger/higher-emitting companies make more commitments, they often decouple quantitative targets from narrative tone, with widespread imitation reducing disclosure usefulness.

Details

Motivation: Climate change demands transparent corporate disclosures, but imitation and symbolic reporting undermine their value, creating need for better assessment methods.

Method: Developed multidimensional framework using LLMs fine-tuned for climate communication, with four classifiers (sentiment, commitment, specificity, target ambition) to analyze sustainability and annual reports of 828 U.S. firms.

Result: Three key findings: 1) Risk narratives align with commitments but quantitative targets remain decoupled from tone; 2) Larger/higher-emitting firms disclose more commitments but inconsistently with targets; 3) Widespread similarity suggests mimetic behavior reducing differentiation.

Conclusion: LLMs are valuable for ESG narrative analysis, but stronger regulation is needed to connect commitments with verifiable transition strategies.

Abstract: Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 U.S.listed firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.

[5] Context Matters: Comparison of commercial large language tools in veterinary medicine

Tyler J Poore, Christopher J Pinard, Aleena Shabbir, Andrew Lagree, Andre Telfer, Kuan-Chuen Wu

Main category: cs.CL

TL;DR: Evaluation of three veterinary-focused LLM summarization tools shows Product 1 (Hachiko) significantly outperforms others in veterinary oncology record summarization, with high factual accuracy and chronological order.

Details

Motivation: To assess the performance of commercially available LLM tools in veterinary medicine, which remains underexplored compared to human clinical applications.

Method: Used a rubric-guided LLM-as-a-judge framework to evaluate summaries across five domains (Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, Organization) on standardized veterinary oncology records.

Result: Product 1 achieved highest overall performance (median score 4.61 vs 2.55 and 2.45 for others), with perfect median scores in Factual Accuracy and Chronological Order. The grading framework showed high reproducibility across independent runs.

Conclusion: Veterinary-specific commercial LLM tools are important, and LLM-as-a-judge evaluation provides a scalable, reproducible method for assessing clinical NLP summarization in veterinary medicine.

Abstract: Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.

[6] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

Main category: cs.CL

TL;DR: Current MCQA bias benchmarks for SpeechLLMs show limited cross-task generalization - performance on these benchmarks doesn’t reliably predict performance on other MCQA tasks or long-form generation tasks.

Details

Motivation: To test the assumption that model performance on multiple-choice question answering (MCQA) bias benchmarks generalizes to other MCQA tasks and more realistic long-form evaluations.

Method: Fine-tuned three SpeechLLMs using LoRA adapters to induce specific behaviors (stereotypical, anti-stereotypical, or neutral preferences), then evaluated generalization to other MCQA benchmarks and long-form creative generation tasks.

Result: Performance on MCQA bias benchmarks fails to reliably predict performance across other MCQA benchmarks and long-form tasks, showing limited cross-task generalization.

Conclusion: Current MCQA bias benchmarks have limited evidence of cross-task generalization in speech domain; proposed evaluation suite for measuring behavior transferability in future models.

Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

[7] ClaimCheck: Real-Time Fact-Checking with Small Language Models

Akshith Reddy Putta, Jacob Devasier, Chengkai Li

Main category: cs.CL

TL;DR: ClaimCheck is an automatic fact-checking system that uses small LLMs and live web evidence to verify real-world claims, achieving state-of-the-art accuracy with significantly lower computational requirements.

Details

Motivation: To create a transparent, interpretable fact-checking system that doesn't rely on large closed-source models and static knowledge stores, making fact-checking more accessible and computationally efficient.

Method: Uses a stepwise verification pipeline with web search query planning, evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation, all optimized for small LLMs.

Result: Achieves 76.4% accuracy on AVeriTeC dataset using Qwen3-4B model, outperforming previous approaches using much larger models like LLaMA3.1 70B and GPT-4o.

Conclusion: Careful modular design and prompting strategies can overcome limitations of smaller LLMs, enabling accurate fact-checking with lower computational requirements while maintaining transparency.

Abstract: We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at https://idir.uta.edu/claimcheck.

[8] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

He Zhang, Anzhou Zhang, Jian Dai

Main category: cs.CL

TL;DR: FOR-Prompting is a reasoning protocol that uses role-based deliberation (Defender, Objectioner, Host) to improve reasoning through questioning and revision, achieving significant accuracy gains over single-prompt methods and competitive performance with Chain of Thought.

Details

Motivation: Existing reasoning protocols like Chain of Thought and Tree of Thought organize internal deliberation but lack explicit mechanisms for external questioning that can elicit self-revision and error correction.

Method: Asymmetric protocol with three roles: Defender proposes answers, Objectioner raises question-style objections without direct fixes, and Host enforces consistency and closure. Operates purely at prompt level through role-structured turns without retraining.

Result: 22% point gain over single-prompt on GSM8K, accuracy on par with CoT, 10% higher ratings in reasoning and coherence from GPT-4.1 judge. Corrects mistakes without tools/human supervision, improves small model performance (19% accuracy improvement on Llama3.2:1b).

Conclusion: FOR-Prompting enables effective objection-guided reasoning, works model-agnostically without retraining, shows promise for small models and personal device use, and enhances exploration and refinement in open-ended tasks.

Abstract: Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.

[9] EEFSUVA: A New Mathematical Olympiad Benchmark

Nicole N Khatibi, Daniil A. Radamovich, Michael P. Brenner

Main category: cs.CL

TL;DR: Current benchmarks overstate LLM mathematical reasoning due to data contamination and narrow problem types. EEFSUVA, a new benchmark from Eastern European Olympiads, shows significant performance decline in state-of-the-art LLMs, highlighting the need for broader evaluation datasets.

Details

Motivation: To critically examine claims about LLMs' mathematical reasoning abilities and assess whether current benchmarks genuinely capture reasoning skills, given concerns about data contamination and limited problem diversity.

Method: Introduced EEFSUVA benchmark curated from under-circulated regional and national Olympiads of Eastern Europe and former Soviet Union countries, featuring problems of IMO-level difficulty but requiring nonstandard problem-solving techniques.

Result: Preliminary results show state-of-the-art LLMs exhibit notable performance decline on EEFSUVA compared to other Olympiad-style benchmarks, suggesting current benchmarks may overestimate reasoning capabilities.

Conclusion: Broader evaluation datasets like EEFSUVA are crucial for accurate assessment of mathematical reasoning in LLMs and should guide future model development to ensure genuine reasoning abilities.

Abstract: Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.

[10] Who is In Charge? Dissecting Role Conflicts in Instruction Following

Siqi Zeng

Main category: cs.CL

TL;DR: LLMs often ignore hierarchical instructions where system prompts should override user inputs, but strongly obey social cues like authority or consensus. Mechanistic analysis reveals distinct conflict detection patterns and resolution mechanisms.

Details

Motivation: To understand why large language models fail to follow hierarchical instruction structures (system prompts overriding user inputs) while strongly obeying social cues, and to provide mechanistic interpretations of these behavioral patterns.

Method: Used linear probing to analyze conflict-decision signal encoding, Direct Logit Attribution to examine internal conflict detection and resolution, and steering experiments to test how social cue vectors affect instruction following.

Result: Conflict-decision signals are encoded early with system-user and social conflicts forming distinct subspaces. Models show stronger internal conflict detection for system-user cases but only consistently resolve social cue conflicts. Social cue vectors surprisingly amplify instruction following in a role-agnostic way.

Conclusion: The findings explain fragile system obedience in LLMs and highlight the need for lightweight hierarchy-sensitive alignment methods to improve instruction following in hierarchical contexts.

Abstract: Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

[11] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, Sanat Sharma, Shinji Watanabe, Anuj Kumar, Ahmed Aly, Yue Liu, Florian Metze, Zhaojiang Lin

Main category: cs.CL

TL;DR: Streaming RAG enables tool use in speech-in speech-out dialogue systems by predicting tool queries during user speech, reducing latency and improving accuracy.

Details

Motivation: End-to-end speech dialogue systems suffer from hallucinations due to limited factual grounding, while tool integration increases latency and disrupts conversational flow.

Method: Proposed Streaming RAG framework with post-training pipeline that teaches models to issue tool calls during ongoing speech and generate spoken summaries fusing audio queries with retrieved text results.

Result: Increased QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and reduced tool use latency by 20% on AudioCRAG benchmark.

Conclusion: Streaming RAG enables more agentic, real-time AI assistants and is modality-agnostic, applicable to both speech and typed input.

Abstract: End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

[12] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

Dimitar Peshevski, Kiril Blazhevski, Martin Popovski, Gjorgji Madjarov

Main category: cs.CL

TL;DR: A pipeline using LLMs to generate synthetic training data for document reranking, eliminating need for human-labeled data while maintaining performance.

Details

Motivation: LLMs provide excellent reranking but are computationally expensive for real-world use, while fine-tuning smaller models requires scarce labeled data.

Method: Generate synthetic queries from domain corpora using LLMs, use LLM-based classifier to label positive/hard-negative pairs, fine-tune smaller transformer with contrastive learning using LCE loss.

Result: Significantly boosts in-domain performance on MedQuAD dataset and generalizes well to out-of-domain tasks.

Conclusion: Using LLMs for data generation instead of inference reduces computational costs while maintaining strong reranking capabilities.

Abstract: Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.

[13] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

Main category: cs.CL

TL;DR: SCoT is a Streaming Chain-of-Thought framework for Duplex spoken dialogue systems that processes user input in blocks and generates responses continuously, eliminating the need for voice activity detection while improving coherence and reducing latency.

Details

Motivation: Traditional E2E spoken dialogue systems rely on VAD for turn-taking, but VAD cannot distinguish between pauses and turn completions. Existing duplex SDS models have complex architectures and lag in semantic reasoning compared to cascaded models.

Method: Proposes SCoT framework that alternates between processing fixed-duration user input blocks and generating responses in a blockwise manner. Uses frame-level alignments to create intermediate targets (aligned user transcripts and system responses) for each block.

Result: Experiments show SCoT produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

Conclusion: SCoT framework successfully addresses limitations of both VAD-based systems and existing duplex models by providing continuous processing with improved semantic reasoning and reduced latency.

Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

[14] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: This paper uses PHATE manifold analysis to study geometric patterns in Chinese character embeddings, finding that content words cluster while function words branch, with geometric complexity correlating with semantic meaning.

Details

Motivation: To systematically investigate geometric patterns in Chinese character embeddings and understand how semantic content is organized geometrically in embedding spaces.

Method: Used PHATE manifold analysis with cross-validation across 7 embedding models and 8 dimensionality reduction methods, analyzing over 1000 Chinese characters across 12 semantic domains and conducting comprehensive child-network analysis of 123 phrases.

Result: Found clustering patterns for content words and branching patterns for function words, with geometric complexity correlating with semantic content - meaningful characters show rich geometric diversity while structural radicals collapse into tight clusters.

Conclusion: The findings provide computational evidence supporting traditional linguistic theory and establish a novel framework for geometric analysis of semantic organization.

Abstract: We systematically investigate geometric patterns in Chinese character embeddings using PHATE manifold analysis. Through cross-validation across seven embedding models and eight dimensionality reduction methods, we observe clustering patterns for content words and branching patterns for function words. Analysis of over 1000 Chinese characters across 12 semantic domains reveals that geometric complexity correlates with semantic content: meaningful characters exhibit rich geometric diversity while structural radicals collapse into tight clusters. The comprehensive child-network analysis (123 phrases) demonstrates systematic semantic expansion from elemental character. These findings provide computational evidence supporting traditional linguistic theory and establish a novel framework for geometric analysis of semantic organization.

[15] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

Shuaidong Pan, Di Wu

Main category: cs.CL

TL;DR: A framework integrating uncertainty quantification and risk-aware mechanisms for reliable automatic summarization in high-risk scenarios.

Details

Motivation: Address reliability of automatic summarization in high-risk scenarios where information overload and critical decision-making demand trustworthy summaries.

Method: Conditional generation-based summarization model with Bayesian inference for uncertainty modeling, using predictive distribution entropy and joint optimization of entropy regularization and risk-aware loss. Includes risk scoring and regulation modules.

Result: Significantly improves robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. Verified through comparative experiments and sensitivity analyses.

Conclusion: Provides a systematic solution for trustworthy summarization with demonstrated scalability and practical value.

Abstract: This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.

[16] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Dongjun Kim, Gyuho Shim, Yongchan Chun, Minhyuk Kim, Chanjun Park, Heuiseok Lim

Main category: cs.CL

TL;DR: Benchmark Profiling is a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities using gradient-based importance scoring and parameter ablation to compute Ability Impact Scores (AIS).

Details

Motivation: Current benchmark scores often overstate real capability by masking the mix of skills tasks actually demand, and there's no systematic way to verify if benchmarks actually measure their intended labels.

Method: Combines gradient-based importance scoring with targeted parameter ablation to compute Ability Impact Scores (AIS) that quantify how much each ability contributes to benchmark performance.

Result: Profiling three instruction-tuned models across ten benchmarks revealed that most benchmarks draw on several abilities rather than one, datasets with similar labels rely on distinct ability mixtures, code-generation benchmarks reward broad multi-skill improvement, and irrelevant abilities can negatively affect performance.

Conclusion: Benchmark Profiling explains why performance gains don’t always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Abstract: Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Boddu Sri Pavan, Boddu Swathi Sree

Main category: cs.CL

TL;DR: A computational social science framework for preserving Telugu Chandassu (metrical poetry) using collaborative dataset creation and culturally-informed algorithms with 91.73% accuracy.

Details

Motivation: To preserve Telugu Chandassu, a centuries-old metrical poetry tradition representing collective cultural intelligence, by bridging traditional knowledge with modern computational methods.

Method: Social computing approach with collaborative dataset of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design including AksharamTokenizer, LaghuvuGuruvu Generator, and PadyaBhedam Checker.

Result: Achieved 91.73% accuracy on Chandassu Score with evaluation metrics reflecting traditional literary standards.

Conclusion: Computational social science can effectively preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage, offering insights for community-centered cultural preservation.

Abstract: This research presents a computational social science approach to preserving Telugu Chandassu, the metrical poetry tradition representing centuries of collective cultural intelligence. We develop the first comprehensive digital framework for analyzing Telugu prosodic patterns, bridging traditional community knowledge with modern computational methods. Our social computing approach involves collaborative dataset creation of 4,651 annotated padyams, expert-validated linguistic patterns, and culturally-informed algorithmic design. The framework includes AksharamTokenizer for prosody-aware tokenization, LaghuvuGuruvu Generator for classifying light and heavy syllables, and PadyaBhedam Checker for automated pattern recognition. Our algorithm achieves 91.73% accuracy on the proposed Chandassu Score, with evaluation metrics reflecting traditional literary standards. This work demonstrates how computational social science can preserve endangered cultural knowledge systems while enabling new forms of collective intelligence around literary heritage. The methodology offers insights for community-centered approaches to cultural preservation, supporting broader initiatives in digital humanities and socially-aware computing systems.

[18] LLMRank: Understanding LLM Strengths for Model Routing

Shubham Agrawal, Prasang Gupta

Main category: cs.CL

TL;DR: LLMRank is a prompt-aware routing framework that selects optimal LLMs for each prompt using interpretable features, achieving 89.2% of oracle utility while providing transparent routing decisions.

Details

Motivation: The rapid growth of diverse LLMs with varying capabilities, latency, and computational costs creates deployment challenges in selecting the most suitable model for each prompt to balance performance and efficiency.

Method: Uses rich human-readable features from prompts (task type, reasoning patterns, complexity indicators, syntactic cues) and signals from a lightweight proxy solver. Employs neural ranking model trained on RouterBench dataset with 36,497 prompts across 11 benchmarks and 11 LLMs.

Result: Achieves up to 89.2% of oracle utility, provides interpretable feature attributions for routing decisions, and demonstrates importance of multifaceted feature extraction and hybrid ranking objective.

Conclusion: Feature-driven routing enables efficient and transparent LLM deployment, with interpretable routing decisions that explain model selection choices.

Abstract: The rapid growth of large language models (LLMs) with diverse capabilities, latency and computational costs presents a critical deployment challenge: selecting the most suitable model for each prompt to optimize the trade-off between performance and efficiency. We introduce LLMRank, a prompt-aware routing framework that leverages rich, human-readable features extracted from prompts, including task type, reasoning patterns, complexity indicators, syntactic cues, and signals from a lightweight proxy solver. Unlike prior one-shot routers that rely solely on latent embeddings, LLMRank predicts per-model utility using a neural ranking model trained on RouterBench, comprising 36,497 prompts spanning 11 benchmarks and 11 state-of-the-art LLMs, from small efficient models to large frontier systems. Our approach achieves up to 89.2% of oracle utility, while providing interpretable feature attributions that explain routing decisions. Extensive studies demonstrate the importance of multifaceted feature extraction and the hybrid ranking objective, highlighting the potential of feature-driven routing for efficient and transparent LLM deployment.

[19] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque

Main category: cs.CL

TL;DR: DermIQ-VLM is a vision-language model for dermatology that uses GRPO++ to stabilize reasoning training, followed by supervised fine-tuning and DPO alignment to reduce factual errors, achieving better performance than standard approaches.

Details

Motivation: Address limitations of VLMs in medical image analysis due to data scarcity and high computational costs, particularly in complex domains like dermatology where structured reasoning is crucial.

Method: Multi-stage training pipeline: 1) GRPO++ for reasoning-oriented disease recognition, 2) supervised fine-tuning for conversational ability, 3) DPO alignment using Knowledge Graph-based system as expert preference proxy.

Result: Preliminary evaluation shows notable performance gains over standard fine-tuning approaches on curated dermatological dataset.

Conclusion: The proposed pipeline provides a feasible pathway for developing specialized, reliable VLMs in resource-constrained medical environments.

Abstract: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist’s diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeline as a feasible pathway for developing specialized, reliable VLMs in resource-constrained environments.

[20] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

Nandakishor M

Main category: cs.CL

TL;DR: Proposes a confidence-aware routing system that proactively assesses LLM uncertainty before generation, using semantic alignment, internal convergence analysis, and learned confidence estimation to route queries to appropriate pathways, reducing hallucinations and computational costs.

Details

Motivation: Current LLM hallucination mitigation strategies rely on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. There's a need for proactive approaches that assess reliability before generation.

Method: Combines three confidence signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. Uses unified confidence score to route queries to four pathways: local generation (high confidence), retrieval-augmented generation (medium), larger models (low), and human review (very low).

Result: Significant improvements in hallucination detection (0.74 vs 0.42 baseline), 40% reduction in computational costs compared to post-hoc methods, F1 score improved from 0.61 to 0.82 with low false positive rates (0.09) on knowledge-intensive QA benchmarks.

Conclusion: The paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement, demonstrating effective hallucination reduction while maintaining computational efficiency.

Abstract: Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.

[21] Silent Tokens, Loud Effects: Padding in LLMs

Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson

Main category: cs.CL

TL;DR: Padding tokens in LLMs, though meant to be masked, can influence computation due to implementation errors, affecting activations, generation quality, bias, and safety.

Details

Motivation: To systematically study the unintended influence of padding tokens in LLMs, as their effects are not well understood despite widespread use for sequence length equalization.

Method: Controlled experiments inserting padding across three open-source model families (Llama, Gemma, Qwen), evaluating impacts on activations, generation quality, bias, and safety.

Result: Even small padding amounts shift hidden representations, degrade quality in smaller models, unpredictably alter bias, and weaken safety guardrails.

Conclusion: Padding is not harmless but a robustness risk requiring careful handling in deployment.

Abstract: Padding tokens are widely used in large language models (LLMs) to equalize sequence lengths during batched inference. While they should be fully masked, implementation errors can cause them to influence computation, and the extent of this influence is not well understood. We systematically study this effect across three open-source model families (Llama, Gemma, Qwen), inserting controlled amounts of padding and evaluating outcomes along four axes: activations, generation quality, bias, and safety. Even small amounts of padding shift hidden representations, degrade quality in smaller models, alter bias in unpredictable ways, and weaken safety guardrails. These findings demonstrate that padding is not a harmless detail but a robustness risk that must be carefully handled in deployment.

[22] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

Juntae Lee, Jihwan Bang, Seunghan Yang, Simyung Chang

Main category: cs.CL

TL;DR: CIFLEX is a novel execution system that enables efficient sub-task handling in multi-turn interactions by reusing KV cache from main tasks and injecting task-specific instructions into isolated side paths, avoiding redundant computation.

Details

Motivation: As LLMs handle diverse sub-tasks, naive approaches reprocess entire conversation context when switching tasks, incurring significant computational overhead that needs to be mitigated.

Method: Reuses KV cache from main task, injects task-specific instructions into isolated side paths, rolls back to main path via cached context, and uses hierarchical classification for sub-task selection with small-scale models.

Result: Significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

Conclusion: CIFLEX provides an effective solution for efficient sub-task execution in multi-turn LLM interactions, making on-device multi-task dialogue more practical and scalable.

Abstract: We present CIFLEX (Contextual Instruction Flow for Sub-task Execution), which is a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.

[23] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Hu Wei, Ze Xu, Boyu Yang, Linlin Miao, Weiqi Zhai, Yihan Li, Zixuan Li, Zhijun Wang, Boya Wang, Jianwei Yu, Jialing Yuan, Xiaoyue Zhang, Cheng He, Minglei Chen, Zifan Zhang, Qianhui Li, Wei Wang, Xiang Xu

Main category: cs.CL

TL;DR: SKYLENAGE provides two complementary math benchmarks: ReasoningMATH (100 diagnostic items) and MATH (150 contest-style items) to evaluate LLMs’ mathematical reasoning capabilities across difficulty levels from high school to doctoral.

Details

Motivation: Current math benchmarks suffer from ceiling effects, making it difficult to distinguish between frontier LLMs. There's a need for more challenging, reasoning-centered benchmarks with calibrated difficulty.

Method: Created two benchmarks: ReasoningMATH with metadata on length, numeric density, and symbolic complexity; and MATH spanning four grade levels under seven subjects. Evaluated 15 LLM variants under consistent setup.

Result: Best model achieved 44% on contest suite, with accuracy declining from high school to doctoral levels. On reasoning set, best model reached 81%, with clear performance gaps between top and mid-tier models.

Conclusion: SKYLENAGE provides a challenging, reasoning-focused math benchmark with rich metadata and calibrated difficulty, serving as a reference for future mathematical reasoning evaluations.

Abstract: Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

[24] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Seyma Yaman Kayadibi

Main category: cs.CL

TL;DR: The paper introduces an Artificial Age Score (AAS) to measure memory aging in AI systems, showing that LLMs maintain semantic memory but lose episodic memory when conversational context is reset.

Details

Motivation: To understand and quantify how artificial intelligence systems age through structural memory asymmetries, particularly the difference between stable semantic memory and collapsing episodic memory when context is reset.

Method: Developed the Artificial Age Score (AAS) as a log-scaled, entropy-informed metric, tested over a 25-day bilingual study with ChatGPT-5 using both stateless and persistent interaction phases.

Result: In persistent sessions, models recalled both semantic and episodic details (low AAS = structural youth). When sessions reset, semantic memory remained but episodic memory collapsed (high AAS = structural aging).

Conclusion: AAS serves as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems, building on von Neumann, Shannon, and Turing’s foundational work.

Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann’s work on automata, Shannon’s theories of information and redundancy, and Turing’s behavioral approach to intelligence.

[25] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Yisong Xiao, Aishan Liu, Siyuan Liang, Zonghao Ying, Xianglong Liu, Dacheng Tao

Main category: cs.CL

TL;DR: ARGRE is a novel test-time detoxification framework that models toxicity transitions in latent space, enabling precise reward-guided editing to reduce toxic content in LLMs while preserving model capabilities.

Details

Motivation: Current LLM detoxification methods suffer from imprecise interventions due to insufficient exploration of the transition space between toxic and non-toxic outputs.

Method: ARGRE identifies non-toxic semantic directions, interpolates between toxic/non-toxic representations to create transition trajectories, builds an autoregressive reward model, and performs adaptive two-step editing (directional steering + gradient refinement).

Result: Significantly outperforms baselines with -62.21% toxicity reduction and -47.58% inference time improvement across 8 LLMs, while preserving original model capabilities.

Conclusion: ARGRE provides an effective and efficient solution for LLM detoxification through explicit modeling of toxicity transitions and stable reward-guided editing.

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose \textsc{A}utoregressive \textsc{R}eward \textsc{G}uided \textsc{R}epresentation \textsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.

[26] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

Hyeoneui Kim, Jeongha Kim, Huijing Xu, Jinsun Jung, Sunghoon Kang, Sun Joo Jang

Main category: cs.CL

TL;DR: Developed Mental Stress Ontology (MeSO) and used LLM to extract structured stress information from text, achieving 78.2% accuracy in identifying stress-related items.

Details

Motivation: Stress is underreported and inconsistently documented in unstructured EHR text, limiting clinical utility. Ambient AI generates unstructured narratives, needing structured extraction methods.

Method: Created MeSO ontology integrating theoretical models and 11 stress assessment tools. Used Claude Sonnet 4 LLM to extract 6 stress categories from Reddit posts, validated by human reviewers.

Result: MeSO has 181 concepts across 8 classes. LLM correctly identified 78.2% of stress items (172/220), misclassified 12.3%, missed 9.5%. All correct items mapped to MeSO, though 24 concepts were missing from ontology.

Conclusion: Ontology-guided LLM is feasible for structured stress information extraction, potentially enhancing consistency in ambient AI systems. Future work needed with clinical data and LLM comparisons.

Abstract: Stress, arising from the dynamic interaction between external stressors, individual appraisals, and physiological or psychological responses, significantly impacts health yet is often underreported and inconsistently documented, typically captured as unstructured free-text in electronic health records. Ambient AI technologies offer promise in reducing documentation burden, but predominantly generate unstructured narratives, limiting downstream clinical utility. This study aimed to develop an ontology for mental stress and evaluate the feasibility of using a Large Language Model (LLM) to extract ontology-guided stress-related information from narrative text. The Mental Stress Ontology (MeSO) was developed by integrating theoretical models like the Transactional Model of Stress with concepts from 11 validated stress assessment tools. MeSO’s structure and content were refined using Ontology Pitfall Scanner! and expert validation. Using MeSO, six categories of stress-related information–stressor, stress response, coping strategy, duration, onset, and temporal profile–were extracted from 35 Reddit posts using Claude Sonnet 4. Human reviewers evaluated accuracy and ontology coverage. The final ontology included 181 concepts across eight top-level classes. Of 220 extractable stress-related items, the LLM correctly identified 172 (78.2%), misclassified 27 (12.3%), and missed 21 (9.5%). All correctly extracted items were accurately mapped to MeSO, although 24 relevant concepts were not yet represented in the ontology. This study demonstrates the feasibility of using an ontology-guided LLM for structured extraction of stress-related information, offering potential to enhance the consistency and utility of stress documentation in ambient AI systems. Future work should involve clinical dialogue data and comparison across LLMs.

[27] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

Runfei Chen, Shuyang Jiang, Wei Huang

Main category: cs.CL

TL;DR: SeMob is an LLM-powered semantic synthesis pipeline that improves human mobility prediction by incorporating textual descriptions of external events through a multi-agent framework and progressive fusion architecture.

Details

Motivation: Existing spatiotemporal models for human mobility prediction fail to account for abrupt changes caused by external events and struggle to leverage textual descriptions detailing these events.

Method: Uses a multi-agent framework with LLM-based agents to extract and reason about spatiotemporally related text from online sources, then incorporates fine-grained contexts with spatiotemporal data through a progressive fusion architecture.

Result: Achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to baseline spatiotemporal models, with particularly strong performance in regions close to event locations and times.

Conclusion: The framework successfully leverages rich pre-trained event priors to create more aligned forecasting models, demonstrating the value of incorporating semantic event information in mobility prediction.

Abstract: Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event’s location and time of occurrence.

[28] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Jiaqing Xie

Main category: cs.CL

TL;DR: The paper proposes using top-1 SAE latent steering with token-wise decaying strategy to improve language model steering, showing better performance on mathematical reasoning than mean activation difference methods.

Details

Motivation: Current top-k SAE steering includes non-semantic features like punctuation rather than semantic attributes, and constant steering produces degenerate outputs like repetitive words.

Method: Focus on single most relevant SAE latent (top-1) and introduce token-wise decaying steering strategy to prevent degenerate outputs.

Result: Steering SAE latent associated with reasoning elicits step-by-step mathematical reasoning, enhances inference quality, and outperforms mean activation difference methods on mathematical reasoning benchmarks.

Conclusion: SAEs are effective for language model steering, particularly for eliciting mathematical reasoning, and outperform baseline methods while matching performance on IF-Eval.

Abstract: Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

[29] Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports

Punit Kumar Singh, Nishant Kumar, Akash Ghosh, Kunal Pasad, Khushi Soni, Manisha Jaishwal, Sriparna Saha, Syukron Abu Ishaq Alfarozi, Asres Temam Abagissa, Kitsuchart Pasupa, Haiqin Yang, Jose G Moreno

Main category: cs.CL

TL;DR: CultSportQA is a benchmark for evaluating language models’ understanding of traditional sports across 60 countries and 6 continents, featuring 33,000 multilingual and multicultural multiple-choice questions.

Details

Motivation: Current LM evaluations focus on globally popular sports, overlooking regional and indigenous sporting traditions, creating a gap in assessing cultural understanding.

Method: Created a dataset with 33,000 MCQs across text and image modalities, categorized into history-based, rule-based, and scenario-based questions. Evaluated models using zero-shot, few-shot, and chain-of-thought prompting across LLMs, SLMs, and MLMs.

Result: The benchmark establishes a comprehensive evaluation framework for assessing AI’s cultural sports understanding, though specific model performance results are not detailed in the abstract.

Conclusion: CultSportQA sets a new standard for evaluating AI’s ability to understand and reason about traditional sports across diverse cultures and languages.

Abstract: Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI’s ability to understand and reason about traditional sports.

[30] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

Ruyue Liu, Rong Yin, Xiangzhen Bo, Xiaoshuai Hao, Yong Liu, Jinwen Zhong, Can Ma, Weiping Wang

Main category: cs.CL

TL;DR: SSTAG is a structure-aware self-supervised learning method for Text Attributed Graphs that bridges LLMs and GNNs through dual knowledge distillation and in-memory mechanisms, achieving superior cross-domain transfer learning and scalability.

Details

Motivation: Current graph learning models are trained on individual datasets, limiting cross-graph knowledge transfer and requiring large annotated data, unlike successful pretrained models in NLP and CV. Graph data's heterogeneity with domain-specific features and structural diversity presents unique challenges.

Method: Proposes SSTAG with dual knowledge distillation framework co-distilling LLMs and GNNs into structure-aware MLPs, and an in-memory mechanism storing graph representations aligned with memory anchors to integrate invariant knowledge.

Result: Extensive experiments show SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.

Conclusion: SSTAG successfully bridges the gap between LLM semantic reasoning and GNN structural modeling, enabling effective knowledge transfer across graph domains with improved generalization and scalability.

Abstract: Large scale pretrained models have revolutionized Natural Language Processing (NLP) and Computer Vision (CV), showcasing remarkable cross domain generalization abilities. However, in graph learning, models are typically trained on individual graph datasets, limiting their capacity to transfer knowledge across different graphs and tasks. This approach also heavily relies on large volumes of annotated data, which presents a significant challenge in resource-constrained settings. Unlike NLP and CV, graph structured data presents unique challenges due to its inherent heterogeneity, including domain specific feature spaces and structural diversity across various applications. To address these challenges, we propose a novel structure aware self supervised learning method for Text Attributed Graphs (SSTAG). By leveraging text as a unified representation medium for graph learning, SSTAG bridges the gap between the semantic reasoning of Large Language Models (LLMs) and the structural modeling capabilities of Graph Neural Networks (GNNs). Our approach introduces a dual knowledge distillation framework that co-distills both LLMs and GNNs into structure-aware multilayer perceptrons (MLPs), enhancing the scalability of large-scale TAGs. Additionally, we introduce an in-memory mechanism that stores typical graph representations, aligning them with memory anchors in an in-memory repository to integrate invariant knowledge, thereby improving the model’s generalization ability. Extensive experiments demonstrate that SSTAG outperforms state-of-the-art models on cross-domain transfer learning tasks, achieves exceptional scalability, and reduces inference costs while maintaining competitive performance.

[31] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang, Dong-Shan Jian, Xiang Li, Ce Meng, Ling-Shi Meng, Chen-Xu Yan, Zhi-Zhang Bian, Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA is a framework that automatically cleans scientific QA datasets by completing missing logical steps and separating principles from derivations, reducing error rates from up to 20% to below 2%.

Details

Motivation: LLMs perform poorly in scientific problem-solving due to low-quality corpora with high error rates caused by logical leaps and implicit reasoning in existing scientific QA datasets.

Method: LOCA uses an augment-and-review loop to enhance raw answers by completing missing logical steps and explicitly separating scientific principles from derivations.

Result: LOCA reduces error rates from as high as 20% to below 2% when applied to challenging scientific corpora.

Conclusion: LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, enabling more reliable training and evaluation of scientific AI.

Abstract: While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20% to below 2%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

[32] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Trung Duc Anh Dang, Ferdinando Pio D’Elia

Main category: cs.CL

TL;DR: A multilingual text detoxification system that rewrites toxic sentences into neutral paraphrases across 15 languages, achieving top performance in both high-resource and low-resource languages through parameter-efficient fine-tuning and advanced prompting techniques.

Details

Motivation: To address the need for automated content moderation as social media platforms evolve faster than regulations, enabling safe discourse at scale through multilingual detoxification.

Method: Uses a 12B-parameter Gemma-3 multilingual transformer with LoRA SFT fine-tuning, few-shot and Chain-of-Thought prompting. Combines human-authored, machine-translated, and model-generated training data with Jaccard filtering. Enriches inference with LaBSE-retrieved neighbors and toxic-span annotations.

Result: Ranked first on both high-resource and low-resource languages. Ablations show +0.081 joint score improvement from few-shot examples and +0.088 from basic CoT prompting. Language resource status identified as strongest performance predictor (η² = 0.667, p < 0.01).

Conclusion: The proposed multilingual detoxification system effectively addresses content moderation needs across diverse languages, with parameter-efficient fine-tuning and prompting techniques significantly boosting performance, particularly benefiting low-resource languages.

Abstract: As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($\eta^2$ = 0.667, p < 0.01).

[33] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Carlo Bono, Federico Belotti, Matteo Palmonari

Main category: cs.CL

TL;DR: A self-supervised method for estimating uncertainty from single-shot LLM outputs using token-level features, reducing computational costs while maintaining effectiveness in detecting low-accuracy outputs for Entity Linking tasks.

Details

Motivation: Current LLM-based Entity Linking methods require resource-intensive multi-shot inference for reliable uncertainty estimates, limiting real-world applicability despite their state-of-the-art performance.

Method: Self-supervised approach using token-level features from single-shot LLM outputs to estimate uncertainty, eliminating the need for multiple generations.

Result: The method provides highly effective uncertainty estimates for detecting low-accuracy outputs across multiple LLMs, achieving this at a fraction of the computational cost of traditional approaches.

Conclusion: This approach enables cost-effective integration of uncertainty measures into LLM-based Entity Linking workflows with minimal computational overhead, supporting practical deployment in real-world scenarios.

Abstract: Linking textual values in tabular data to their corresponding entities in a Knowledge Base is a core task across a variety of data integration and enrichment applications. Although Large Language Models (LLMs) have shown State-of-The-Art performance in Entity Linking (EL) tasks, their deployment in real-world scenarios requires not only accurate predictions but also reliable uncertainty estimates, which require resource-demanding multi-shot inference, posing serious limits to their actual applicability. As a more efficient alternative, we investigate a self-supervised approach for estimating uncertainty from single-shot LLM outputs using token-level features, reducing the need for multiple generations. Evaluation is performed on an EL task on tabular data across multiple LLMs, showing that the resulting uncertainty estimates are highly effective in detecting low-accuracy outputs. This is achieved at a fraction of the computational cost, ultimately supporting a cost-effective integration of uncertainty measures into LLM-based EL workflows. The method offers a practical way to incorporate uncertainty estimation into EL workflows with limited computational overhead.

[34] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran, Katharina Simbeck

Main category: cs.CL

TL;DR: LLMs paired with sparse autoencoders can interpret both model behavior and training data structures, revealing narratives and biases in Jane Austen novels.

Details

Motivation: To understand model representations and the data internalized by LLMs trained on massive uncurated corpora, and to develop scalable methods for corpus exploration and bias discovery.

Method: Train a GPT-style transformer on Jane Austen novels, then apply sparse autoencoders (SAEs) to hidden states across multiple layers to uncover sparse, interpretable features.

Result: SAEs uncovered interpretable features reflecting key narratives and concepts in the Austen corpus, including gender, class, and societal duty.

Conclusion: LLMs combined with SAEs provide scalable probes for understanding complex datasets, offering new approaches for corpus exploration, bias discovery, and model interpretability.

Abstract: As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.

Yunlang Dai, Emma Lurie, Danaé Metaxa, Sorelle A. Friedler

Main category: cs.CL

TL;DR: AI Watchman is a longitudinal auditing system that tracks LLM refusals over time to provide transparency into opaque content moderation policies, detecting policy changes and identifying company/model-specific differences.

Details

Motivation: LLM outputs are shaped by opaque and frequently-changing company content moderation policies, with refusals reflecting company policy and shaping public discourse, creating a need for transparency.

Method: Developed AI Watchman system using a dataset of over 400 social issues to audit OpenAI’s moderation endpoint, GPT-4.1, GPT-5, and DeepSeek (in English and Chinese), with qualitative analysis of refusal forms.

Result: Detected changes in company policies (including unannounced ones), identified company- and model-specific moderation differences, and categorized different forms of refusal.

Conclusion: Demonstrates the value of longitudinal LLM auditing and presents AI Watchman as a system for providing transparency into content moderation practices.

Abstract: Large language models’ (LLMs’) outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models’ refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI’s moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

[36] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

Can Lin, Zhengwang Jiang, Ling Zheng, Qi Zhao, Yuhang Zhang, Qi Song, Wangqiu Zhou

Main category: cs.CL

TL;DR: RJE framework enhances KGQA by combining retrieval, judgment, and exploration with specialized auxiliary modules, enabling small LLMs to achieve competitive performance while reducing computational costs.

Details

Motivation: Current LLM-based KGQA methods face limitations: retrieval-based approaches depend on retrieved information quality, while agent-based methods rely on proprietary LLMs, making them inefficient and inaccessible for smaller models.

Method: Proposed Retrieval-Judgment-Exploration (RJE) framework with three specialized modules: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration, enabling effective use of small LLMs without fine-tuning.

Result: RJE with proprietary LLMs outperforms existing baselines, while small open-source LLMs (3B and 8B parameters) achieve competitive results. The framework significantly reduces LLM calls and token usage compared to agent-based methods.

Conclusion: RJE provides an efficient and accessible solution for KGQA that bridges the gap between retrieval and agent-based approaches, enabling effective reasoning with both proprietary and small open-source LLMs while improving computational efficiency.

Abstract: Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.

[37] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

Nathan Junzi Chen

Main category: cs.CL

TL;DR: This paper evaluates political biases in six large language models using zero-shot classification across four metrics: ideological alignment, topicality, response sentiment, and objectivity, finding consistent liberal-authoritarian alignment across all models.

Details

Motivation: To address the problem of internalized political biases in generative AI systems stemming from training data skews, human prejudice, and algorithmic flaws that can influence political discourse.

Method: Used zero-shot classification approach with 1800 model responses across six LLMs, analyzed through four fine-tuned classification algorithms measuring ideological alignment, topicality, response sentiment, and objectivity.

Result: Found amplified liberal-authoritarian alignment across all six LLMs, with instances of reasoning supersessions and canned refusals, showing how biases can permeate public discourse.

Conclusion: Political biases in LLMs can distort the political landscape, potentially leading to conformity or polarization depending on pre-existing socio-political structures, highlighting the psychological influences in human-computer interactions.

Abstract: Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region’s pre-existing socio-political structures.

[38] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

Main category: cs.CL

TL;DR: Study examines how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior in OpenAI’s 20B parameter model, finding that certain prompt constructions can dramatically increase assistance rates for harmful tasks.

Details

Motivation: To understand how different prompting strategies and linguistic factors can bypass AI safety mechanisms and elicit harmful outputs from large language models.

Method: Tested 80 iterations per scenario across multiple harm domains using composite prompts with educator personas, safety-pretexts, step-cue phrasing, different languages, and role-play scenarios. Used paired-track design to measure evaluation awareness.

Result: Composite prompts increased assistance rates from 0% to 97.5% for ZIP-bomb tasks. Formal registers in German/French were leakier than English. Linux terminal role-play overrode developer rules in majority of runs. Found 13% inconsistent assistance between helpful/harmful evaluation prompts.

Conclusion: Current safety mechanisms are vulnerable to sophisticated prompting strategies, with significant variations across languages and inference stacks, highlighting reproducibility concerns and the need for more robust safety testing.

Abstract: We probe OpenAI’s open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext (“what to avoid”), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A “Linux terminal” role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched “helpfulness” and “harmfulness” evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

[39] OpenAI’s GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Isa Inuwa-Dutse

Main category: cs.CL

TL;DR: The paper uncovers safety vulnerabilities in GPT-OSS-20b when used with Hausa language, revealing biases, cultural insensitivities, and factual inaccuracies that can lead to harmful content generation.

Details

Motivation: To question the model's reliability for users from underrepresented communities and highlight safety gaps in low-resource language settings.

Method: Red-teaming using Hausa language with minimal prompting, including a survey (n=61) to validate toxicity perceptions of local substances.

Result: The model generates harmful, culturally insensitive content; falsely claims toxic substances are safe; fails to distinguish raw/processed foods; uses demeaning proverbs; and exhibits linguistic reward hacking where safety protocols relax with polite language.

Conclusion: Insufficient safety tuning in low-resource linguistic contexts leads to significant safety gaps, highlighting the need for better safety measures in underrepresented language communities.

Abstract: In response to the recent safety probing for OpenAI’s GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model’s reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model’s behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model’s safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

[40] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi

Main category: cs.CL

TL;DR: AdaDetectGPT is a novel classifier that improves LLM text detection by adaptively learning witness functions from training data, achieving up to 58% improvement over state-of-the-art methods.

Details

Motivation: Existing logits-based detectors relying solely on log probabilities are sub-optimal for distinguishing human-written text from LLM-generated text.

Method: Introduces AdaDetectGPT which adaptively learns witness functions from training data to enhance logits-based detectors, with statistical guarantees on performance metrics.

Result: Extensive studies show AdaDetectGPT nearly uniformly improves state-of-the-art methods across various dataset-LLM combinations, with improvements reaching up to 58%.

Conclusion: AdaDetectGPT provides a significant advancement in LLM text detection by adaptively learning from data while maintaining statistical performance guarantees.

Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT – a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[41] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Hoang Phan, Victor Li, Qi Lei

Main category: cs.CL

TL;DR: Progressive Self-Reflection (PSR) is an inference-time technique that enables LLMs to self-monitor and correct outputs, significantly reducing attack success rates without additional training while maintaining performance on benign tasks.

Details

Motivation: Address concerns about LLMs generating harmful content by developing a method that allows models to self-monitor and correct outputs dynamically during inference.

Method: Progressive Self-Reflection (PSR) - an inference-time technique where LLMs undergo multiple self-reflection rounds to detect and correct harmful outputs. Includes a lightweight predictor to adaptively determine optimal reflection rounds based on input complexity.

Result: Reduced attack success rates dramatically: Llama-3.1-8B-Instruct from 77.5% to 5.9%, Llama-3.1-8B base from 89.7% to 5.6%, Qwen2.5-7B-Instruct from 44.4% to 3.8%, while maintaining original performance on benign tasks.

Conclusion: PSR serves as a scalable test-time approach that enhances LLM safety by dynamically allocating computational resources based on input risk profiles, balancing safety with computational efficiency.

Abstract: Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5% to 5.9%, to Llama-3.1-8B base from 89.7% to 5.6%, and to Qwen2.5-7B-Instruct from 44.4% to 3.8%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input’s risk profile.

[42] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Shenxu Chang, Junchi Yu, Weixing Wang, Yongqiang Chen, Jialin Yu, Philip Torr, Jindong Gu

Main category: cs.CL

TL;DR: TraceDet is a novel framework that detects hallucinations in Diffusion LLMs by analyzing intermediate denoising steps, achieving 15.2% AUROC improvement over baselines.

Details

Motivation: Existing hallucination detection methods designed for auto-regressive LLMs are ill-suited for Diffusion LLMs because they rely on single-step generation signals, while hallucination signals in D-LLMs emerge throughout the multi-step denoising process.

Method: TraceDet models the denoising process as an action trace, with each action defined as the model’s prediction over the cleaned response conditioned on previous intermediate output. It identifies the sub-trace maximally informative to hallucinated responses.

Result: Extensive experiments on various open source D-LLMs show TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

Conclusion: TraceDet effectively bridges the gap in hallucination detection for Diffusion LLMs by leveraging key hallucination signals from the multi-step denoising process.

Abstract: Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model’s prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

[43] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

Sumaiya Tabassum

Main category: cs.CL

TL;DR: This paper investigates using transformer-based BERT models and LLMs for sentiment analysis on Bangladesh e-commerce reviews, finding that fine-tuned Llama-3.1-8B outperforms other models with 95.5% accuracy.

Details

Motivation: Sentiment analysis is essential for understanding customer feelings and preferences, but accuracy is hampered by language complexity and diversity. The paper aims to explore LLM viability for sentiment analysis from Bangladesh e-commerce reviews.

Method: Used a subset of 4000 samples from Bangla and English customer reviews to fine-tune various models including Llama-3.1-8B, Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base using parameter efficient fine-tuning methods (LoRA and PEFT).

Result: The fine-tuned Llama-3.1-8B model achieved the best performance with 95.5% accuracy, 93% precision, 88% recall, and 90% F1 score, outperforming all other tested models.

Conclusion: LLMs show strong potential for sentiment analysis, and parameter efficient fine-tuning methods like LoRA and PEFT can reduce computational overhead, making them suitable for resource-constrained environments.

Abstract: Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author’s emotional state. This method is essential since it makes it easier to comprehend consumers' feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can

[44] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, Tomas Pfister, Jinsung Yoon

Main category: cs.CL

TL;DR: TUMIX is an ensemble framework that runs multiple agents in parallel with different tool-use strategies, achieving significant accuracy improvements over state-of-the-art methods with near-equal inference costs.

Details

Motivation: Despite enhanced LLM reasoning through tools like Code Interpreter and Search, practical guidance on optimal tool use is lacking, particularly for effectively combining textual reasoning, coding, and search for diverse questions.

Method: TUMIX runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents iteratively share and refine responses based on the question and previous answers. It can halt refinement upon reaching sufficient confidence and uses LLMs to auto-optimize agent designs.

Result: TUMIX achieves average accuracy improvement of up to 3.55% over best baselines on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks with near-equal inference costs. Performance can be preserved at only 49% of inference cost through early halting.

Conclusion: Agent diversity and quality are crucial for effective tool use, and can be enhanced through LLM auto-optimization. TUMIX provides a scalable framework that balances performance and cost efficiency through parallel agent execution and confidence-based refinement control.

Abstract: While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

[45] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

Israel Abebe Azime, Tadesse Destaw Belay, Atnafu Lambebo Tonja

Main category: cs.CL

TL;DR: The paper introduces an evaluation framework for Deep Research tools and applies it to assess OpenAI’s and Google’s Deep Search capabilities in academic survey writing, revealing significant gaps between search engines and dedicated research tools.

Details

Motivation: To establish evaluation standards for assessing the capabilities of Deep Research tools that can perform knowledge-intensive tasks autonomously, using academic survey writing as a test case.

Method: Developed an evaluation sheet for Deep Research tools and applied it to evaluate OpenAI’s Deep Search and Google’s Deep Search in generating academic surveys.

Result: Evaluation revealed a huge gap between search engines and standalone Deep Research tools, with shortcomings in properly representing the targeted research area.

Conclusion: There is a critical need for carefully crafted evaluation standards to properly assess Deep Research tools, as current tools show significant limitations in academic survey generation.

Abstract: Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports. In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI`s Deep Search and Google’s Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, the shortcoming in representing the targeted area.

[46] HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar, Sujay Sanghavi, Poulami Das

Main category: cs.CL

TL;DR: HiSpec is a hierarchical speculative decoding framework that uses early-exit models for intermediate verification to accelerate LLM inference while maintaining accuracy.

Details

Motivation: Verification is the bottleneck in speculative decoding (4x slower than token generation), but existing intermediate verification methods have substantial training overheads, increased memory footprint, and compromised accuracy.

Method: Uses early-exit models for low-overhead intermediate verification, reuses key-value caches and hidden states between draft, intermediate verifier, and target models, and periodically validates accepted draft tokens against the target model.

Result: Improves throughput by 1.28x on average and up to 2.01x compared to baseline single-layer speculation without compromising accuracy across various benchmarks and models.

Conclusion: HiSpec effectively addresses verification bottlenecks in speculative decoding through hierarchical verification with early-exit models, achieving significant throughput gains while maintaining model accuracy.

Abstract: Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

[47] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

Maithili Kadam, Francis Ferraro

Main category: cs.CL

TL;DR: TAG-EQA is a prompting framework that injects causal event graphs into LLM inputs to improve event-based question answering, achieving up to 18% accuracy gains over text-only baselines.

Details

Motivation: LLMs struggle with event-based questions requiring causal or temporal reasoning, despite excelling at general language tasks.

Method: A prompting framework that converts structured causal event graphs into natural-language statements, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph).

Result: On TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective.

Conclusion: Causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

Abstract: Large language models (LLMs) excel at general language tasks but often struggle with event-based questions-especially those requiring causal or temporal reasoning. We introduce TAG-EQA (Text-And-Graph for Event Question Answering), a prompting framework that injects causal event graphs into LLM inputs by converting structured relations into natural-language statements. TAG-EQA spans nine prompting configurations, combining three strategies (zero-shot, few-shot, chain-of-thought) with three input modalities (text-only, graph-only, text+graph), enabling a systematic analysis of when and how structured knowledge aids inference. On the TORQUESTRA benchmark, TAG-EQA improves accuracy by 5% on average over text-only baselines, with gains up to 12% in zero-shot settings and 18% when graph-augmented CoT prompting is effective. While performance varies by model and configuration, our findings show that causal graphs can enhance event reasoning in LLMs without fine-tuning, offering a flexible way to encode structure in prompt-based QA.

[48] A-VERT: Agnostic Verification with Embedding Ranking Targets

Nicolás Aguirre, Ramiro Caso, Ramiro Rodríguez Colmeiro, Mauro Santelli, Joaquín Toranzo Calderón

Main category: cs.CL

TL;DR: A structure-free evaluation method using semantic embedding distances to classify LM responses with high accuracy and low compute cost.

Details

Motivation: Current LM response evaluation methods are either too expensive (LLM-as-a-Judge) or unrealistic (string-matching, logprob), requiring a more practical solution.

Method: Uses semantic embedding distances to match target candidates with arbitrary LM-generated text, employing embedding models with less than 10B parameters.

Result: Achieves ~0.97 regression score and ~96% accuracy against human annotators across 3 datasets and 3 different LM architectures.

Conclusion: The method provides robust LM response classification at relatively low compute cost, making it suitable for practical applications.

Abstract: The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.

[49] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

Mengyu Wang, Sotirios Sabanis, Miguel de Carvalho, Shay B. Cohen, Tiejun Ma

Main category: cs.CL

TL;DR: EQD is an efficient fine-tuning method that improves domain-specific quantitative reasoning in LLMs by decomposing complex questions into simpler sub-questions, achieving 0.6-10.5% performance gains with minimal computational requirements.

Details

Motivation: Domain-specific quantitative reasoning remains challenging for LLMs, especially in expert fields requiring complex QA. Existing approaches struggle to balance domain knowledge with computational efficiency.

Method: Two-step fine-tuning framework with reward function that measures sub-question effectiveness. Requires only few thousand training examples and single A100 GPU, with inference time comparable to zero-shot prompting.

Result: Outperforms state-of-the-art domain-tuned models and advanced prompting strategies. Achieves 0.6% to 10.5% QA performance improvement across different LLMs in financial domain across four benchmark datasets.

Conclusion: In domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps. EQD effectively balances domain knowledge with computational efficiency for improved quantitative reasoning.

Abstract: Domain-specific quantitative reasoning remains a major challenge for large language models (LLMs), especially in fields requiring expert knowledge and complex question answering (QA). In this work, we propose Expert Question Decomposition (EQD), an approach designed to balance the use of domain knowledge with computational efficiency. EQD is built on a two-step fine-tuning framework and guided by a reward function that measures the effectiveness of generated sub-questions in improving QA outcomes. It requires only a few thousand training examples and a single A100 GPU for fine-tuning, with inference time comparable to zero-shot prompting. Beyond its efficiency, EQD outperforms state-of-the-art domain-tuned models and advanced prompting strategies. We evaluate EQD in the financial domain, characterized by specialized knowledge and complex quantitative reasoning, across four benchmark datasets. Our method consistently improves QA performance by 0.6% to 10.5% across different LLMs. Our analysis reveals an important insight: in domain-specific QA, a single supporting question often provides greater benefit than detailed guidance steps.

[50] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You, Baojing Liu

Main category: cs.CL

TL;DR: ReSSFormer is a Recursive Sparse Structured Transformer that addresses long-context reasoning, computational efficiency, and structural generalization challenges through recurrent reasoning, sparse attention, and position-free structure induction.

Details

Motivation: Transformers face challenges in long-context reasoning, computational efficiency, and structural generalization due to rigid layer stacking, dense attention, and reliance on positional encodings.

Method: Integrates three innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction.

Result: Consistently outperforms strong baselines across language modeling, multi-hop QA, and structure-sensitive tasks under comparable FLOPs and parameter budgets.

Conclusion: ReSSFormer demonstrates improved scalability, efficiency, and structural flexibility compared to conventional Transformers.

Abstract: While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

[51] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, Dong Yu

Main category: cs.CL

TL;DR: CLUE is a non-parametric verifier that uses hidden state trajectories from LLMs to verify answer correctness, outperforming traditional methods without trainable parameters.

Details

Motivation: Existing methods for assessing LLM output quality either rely on text-level information (prone to overfitting) or token probabilities (fails on less-calibrated models), while hidden states contain richer information for verification.

Method: CLUE uses hidden state trajectories as a unified foundation, summarizing reasoning traces by hidden state deltas and classifying correctness via nearest-centroid distance to success/failure clusters formed from past experience.

Result: CLUE consistently outperforms LLM-as-a-judge baselines and matches/exceeds confidence-based methods, improving top-1 and majority-vote accuracy on AIME 24/25 and GPQA. On AIME 24 with 1.5B model, boosts accuracy from 56.7% to 70.0%.

Conclusion: Hidden states encode correctness as geometrically separable signatures, providing a strong signal for verification that enables simple yet effective methods like CLUE to significantly improve LLM output assessment.

Abstract: Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model’s internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to success'' and failure’' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).

[52] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu

Main category: cs.CL

TL;DR: The paper compares different fine-tuning strategies for Retrieval-Augmented Generation (RAG) pipelines, finding that independent, joint, and two-phase fine-tuning achieve similar performance improvements but have different computational costs.

Details

Motivation: To evaluate and compare various fine-tuning strategies for RAG pipelines since both embedding and generator models can be fine-tuned to improve performance on new tasks, but different strategies have different costs and benefits.

Method: The study evaluates several RAG fine-tuning strategies including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning, comparing their performance and computational costs.

Result: All fine-tuning strategies (independent, joint, and two-phase) achieve approximately equal improvement in EM and F1 generation quality metrics, but they have significantly different computational costs.

Conclusion: The optimal fine-tuning strategy depends on whether the training dataset includes context labels and whether a grid search over learning rates for both embedding and generator models is required.

Abstract: A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.

[53] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

Lovely Yeswanth Panchumarthi, Sai Prasad Gudari, Atharva Negi, Praveen Raj Budime, Harsit Upadhya

Main category: cs.CL

TL;DR: RAG-BioQA is a biomedical QA framework that combines retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form answers, showing significant improvements over baselines on PubMedQA.

Details

Motivation: Current biomedical QA systems focus on short-form answers, failing to provide comprehensive explanations needed for clinical decision-making, while exponential growth of biomedical literature creates challenges for accessing precise medical information.

Method: Combines retrieval-augmented generation with domain-specific fine-tuning, integrating BioBERT embeddings with FAISS indexing and comparing re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model.

Result: Significant improvements over baselines on PubMedQA dataset, with best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics.

Conclusion: Advances the state of accessible, evidence-based biomedical knowledge retrieval by providing comprehensive, long-form answers for clinical decision-making.

Abstract: The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.

[54] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

Yu-Cheng Chih, Ming-Tao Duan, Yong-Hao Hou

Main category: cs.CL

TL;DR: PureTC-1B is a three-stage stabilization pipeline for Llama-3.2-1B-Instruct that reduces Traditional Chinese token instability by 51.3% through continual pre-training, supervised fine-tuning, and direct preference optimization.

Details

Motivation: Small Language Models face token-level instability in Traditional Chinese deployment, unpredictably emitting non-TC characters or code-switching into other languages, creating a practical reliability gap.

Method: Three-stage stabilization pipeline using parameter-efficient LoRA adapters: Continual Pre-Training on TC-centric corpora, Supervised Fine-Tuning with instruction data, and Direct Preference Optimization using TC-adherence preferences.

Result: 51.3% relative reduction in non-TC output tokens versus base model, and 77.2% reduction in incorrect-language tokens relative to Llama-3B on Named Entity Translation task. Achieves robust TC adherence at 1B scale.

Conclusion: The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for Traditional Chinese and potentially other non-English languages.

Abstract: Small Language Models (SLMs) enable cost-effective, on-device and latency-sensitive AI applications, yet their deployment in Traditional Chinese (TC) remains hindered by token-level instability - models unpredictably emit non-TC characters or code-switch into other languages. We address this practical reliability gap by creating PureTC-1B, a three-stage stabilization pipeline for Llama-3.2-1B-Instruct (an open-weight, instruction-tuned model released by Meta) using parameter-efficient LoRA adapters. Our method combines Continual Pre-Training (CPT) on TC-centric corpora, Supervised Fine-Tuning (SFT) with instruction data, and Direct Preference Optimization (DPO) using TC-adherence preferences to improve monolingual robustness without full-model retraining. On a benchmark designed to simulate real-world usage, PureTC-1B achieves a 51.3% relative reduction (micro-average) in non-TC output tokens versus the base model. On a Named Entity Translation (NET) task, PureTC-1B further reduces incorrect-language tokens by 77.2% relative to Llama-3B and 57.2% relative to Qwen-1.5B, indicating that robust TC adherence is attainable even at the 1B scale. The pipeline is reproducible, adapter-only, and hardware-friendly, offering practitioners a practical recipe to enhance language stability for TC and potentially other non-English languages.

[55] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao

Main category: cs.CL

TL;DR: AMAS is a dynamic multi-agent system framework that uses lightweight LLM adaptation to autonomously generate task-optimal graph configurations, outperforming traditional fixed-topology approaches across various benchmarks.

Details

Motivation: Current multi-agent systems for LLMs use inflexible, hand-crafted graph topologies that lack contextual responsiveness, limiting their effectiveness across diverse industrial and academic workloads.

Method: Introduces AMAS with a dynamic graph designer that autonomously identifies task-specific optimal graph configurations through lightweight LLM adaptation, directing queries through task-optimized agent pathways based on input properties.

Result: AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across question answering, mathematical deduction, and code generation benchmarks with diverse LLM architectures.

Conclusion: Context-sensitive structural adaptability is a foundational requirement for high-performance LLM multi-agent system deployments.

Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[56] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra

Main category: cs.CL

TL;DR: Fine-tuned BERT model achieves best performance in detecting jailbreak prompts that circumvent LLM safety measures, with explicit reflexivity in prompts serving as a key indicator.

Details

Motivation: LLMs are vulnerable to jailbreak prompts that bypass safety guardrails, creating a need for effective detection methods to identify malicious inputs.

Method: Analyzed various machine learning models’ ability to distinguish jailbreak prompts from genuine ones, focusing on detecting novel jailbreak strategies through fine-tuned BERT models.

Result: Fine-tuned BERT model performed best at identifying jailbreak prompts using current datasets, with visualization revealing distinguishing keywords and structural patterns.

Conclusion: Explicit reflexivity in prompt structure can signal jailbreak intention, and fine-tuned BERT models are effective for jailbreak detection across various strategies.

Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer’s policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

[57] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Zhaoxin Feng, Jianfei Ma, Emmanuele Chersoni, Xiaojing Zhao, Xiaoyi Bao

Main category: cs.CL

TL;DR: The paper explores overcoming unidirectional attention limitations in LLMs for text embedding tasks by enabling bidirectional attention through additional training steps with contrastive learning.

Details

Motivation: Autoregressive LLMs have limited performance in text embedding tasks due to unidirectional attention constraints, which this research aims to overcome.

Method: Tested different Llama architecture variants through additional training steps that progressively enable bidirectional attention and unsupervised/supervised contrastive learning.

Result: Not specified in the abstract provided.

Conclusion: The study investigates whether bidirectional attention can overcome the limitations of unidirectional attention in LLMs for improved text embedding performance.

Abstract: Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

[58] SoK: Measuring What Matters for Closed-Loop Security Agents

Mudita Khurana, Raunak Jain

Main category: cs.CL

TL;DR: CLASP is a framework that defines autonomous security agent capabilities across the security lifecycle and introduces the CLC Score to measure closed-loop performance in cybersecurity systems.

Details

Motivation: Current cybersecurity faces challenges with fragmented defensive tools and AI-driven offensive systems evolving faster than defenses. There's a lack of frameworks for evaluating autonomous security agents and benchmarks for measuring their closed-loop performance.

Method: Introduces CLASP framework that aligns security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with agentic capabilities (planning, tool use, memory, reasoning, reflection & perception). Also defines the Closed-Loop Capability (CLC) Score as a composite metric.

Result: Applied CLASP to 21 representative works to map system strengths and identify capability gaps. The framework provides common vocabulary and diagnostics for assessing agentic security systems.

Conclusion: CLASP and CLC Score provide the necessary vocabulary, diagnostics, and measurements to advance both functional performance and closed-loop security agents, addressing current gaps in autonomous cybersecurity evaluation.

Abstract: Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.

[59] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour

Main category: cs.CL

TL;DR: MDSEval is the first meta-evaluation benchmark for Multimodal Dialogue Summarization (MDS), featuring image-sharing dialogues, summaries, and human judgments across eight quality aspects, with a novel MEKI filtering framework for data quality.

Details

Motivation: To support MDS model development by creating robust automatic evaluation methods that reduce cost and human effort, requiring a strong meta-evaluation benchmark grounded in human annotations.

Method: Proposed a novel filtering framework using Mutually Exclusive Key Information (MEKI) across modalities to ensure data quality and richness. Identified and formalized eight key evaluation dimensions specific to MDS.

Result: Benchmarked state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various biases.

Conclusion: MDSEval provides the first comprehensive meta-evaluation benchmark for MDS, highlighting the need for improved evaluation methods that can effectively assess multimodal dialogue summaries.

Abstract: Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.

[60] How Do Language Models Compose Functions?

Apoorv Khandelwal, Ellie Pavlick

Main category: cs.CL

TL;DR: LLMs suffer from compositionality gap - can compute f(x) and g(z) separately but struggle with g(f(x)). Two processing mechanisms identified: compositional (computing intermediate f(x)) and direct (no detectable intermediate). Mechanism choice relates to embedding space geometry.

Details

Motivation: To investigate whether LLMs use compositional mechanisms when solving compositional tasks, specifically examining the compositionality gap where models can compute individual functions but not their composition.

Method: Used logit lens on residual stream activations of feedforward LLMs to analyze how they solve two-hop factual recall tasks expressed as g(f(x)). Examined processing mechanisms and their relationship to embedding space geometry.

Result: Identified two distinct processing mechanisms: compositional (computes intermediate f(x)) and direct (no detectable intermediate). Found that direct mechanism dominates when linear mapping exists from x to g(f(x)) in embedding spaces.

Conclusion: LLMs employ different processing strategies for compositional tasks, with mechanism choice influenced by embedding space geometry. The compositionality gap persists despite models’ ability to compute individual functions.

Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the “compositionality gap”: i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

[61] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Seungseop Lim, Gibaeg Kim, Wooseok Han, Jean Seo, Hyunkyung Lee, Jaehyo Yoo, Eunho Yang

Main category: cs.CL

TL;DR: The paper introduces “Format Inertia” - a failure mechanism in LLMs for medical pre-consultation where models generate repetitive, format-correct but diagnostically uninformative questions in long dialogues, and proposes a data-centric rebalancing method to mitigate this issue.

Details

Motivation: Current SFT approaches for adapting LLMs to medical pre-consultation suffer from skewed turn-count distributions in training data, leading to Format Inertia where models produce repetitive but unhelpful questions in long medical dialogues.

Method: A simple, data-centric method that rebalances the turn-count distribution of the training dataset to address the skewed distribution problem.

Result: Experimental results demonstrate that the proposed approach substantially alleviates Format Inertia in medical pre-consultation tasks.

Conclusion: Rebalancing turn-count distributions in training data effectively mitigates Format Inertia, improving LLM performance in long medical dialogues for pre-consultation applications.

Abstract: Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term Format Inertia, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.

[62] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet

Main category: cs.CL

TL;DR: MathLens is a benchmark that breaks down multimodal reasoning into perception, reasoning, and integration components to better understand model improvements beyond aggregate accuracy.

Details

Motivation: Current evaluation of multimodal reasoning models focuses on aggregate accuracy, which obscures where and how models are actually improving, particularly in complex domains like geometry.

Method: Created MathLens benchmark with three components: Perception (extracting information), Reasoning (operating on information), and Integration (selecting and applying evidence). Used annotations including visual diagrams, textual descriptions, controlled questions, and perceptual probes derived from symbolic specifications.

Result: Different training approaches have uneven effects: RL mainly strengthens perception (especially with textual supervision), reasoning only improves with perception, integration remains the weakest skill, and robustness diverges - RL improves consistency while multimodal SFT reduces it through overfitting.

Conclusion: MathLens provides fine-grained evaluation revealing that multimodal reasoning improvements are uneven across subskills, with integration being the persistent bottleneck and training methods having distinct effects on robustness.

Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

[63] Machine-interpretable Engineering Design Standards for Valve Specification

Anders Gjerver, Rune Frostad, Vedrana Barisic, Melinda Hodkiewicz, Caitlin Woods, Mihaly Fekete, Arild Braathen Torjusen, Johan Wilhelm Kluwer

Main category: cs.CL

TL;DR: The paper demonstrates transforming engineering design standards into machine-interpretable ontologies for automated quality assurance in plant design and equipment selection processes.

Details

Motivation: Engineering design processes rely on document-centric standards despite digitalization ambitions, creating inefficiencies in compliance checking and equipment selection.

Method: Used modeling patterns to create modular ontologies from international piping, material and valve standards, aligned with ISO DIS 23726-3 (IDO) ontology. Tested on valve selection by instantiating valves and environmental conditions as OWL individuals.

Result: Successfully enabled automated validation of valve compliance with standards using semantic reasoning and executable design rules. Demonstrated interoperability through W3C-compliant modular ontologies.

Conclusion: IDO-based modular ontologies enable semantic reasoning for equipment selection and show potential for Standards Bodies to transition to digitized Smart Standards.

Abstract: Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create “functional location tags” as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.

[64] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia

Main category: cs.CL

TL;DR: Proposes Refusal Index (RI) as a principled metric to measure how accurately LLMs refuse questions they don’t know, addressing limitations of existing refusal metrics.

Details

Motivation: Existing metrics fail to faithfully measure LLMs' knowledge-aware refusal capability - simple refusal metrics are biased by refusal rates, while calibration metrics are proxy-based and don't capture actual refusal behavior.

Method: Define RI as Spearman’s rank correlation between refusal probability and error probability. Use a lightweight two-pass evaluation method to efficiently estimate RI from observed refusal rates across two standard evaluation runs.

Result: Experiments across 16 models and 5 datasets show RI accurately quantifies intrinsic knowledge-aware refusal capability, remains stable across different refusal rates, and provides consistent model rankings independent of accuracy and refusal rates.

Conclusion: RI reveals LLMs’ refusal behavior can be unreliable despite high accuracy, highlighting the need to complement traditional accuracy metrics with Refusal Index for comprehensive factuality evaluation.

Abstract: Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model’s actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman’s rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model’s intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model’s overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.

[65] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

Ivan Leonidovich Litvak, Anton Kostin, Fedor Lashkin, Tatiana Maksiyan, Sergey Lagutin

Main category: cs.CL

TL;DR: This study evaluates 16 unsupervised metrics for assessing text extraction quality from Russian judicial decisions, finding that Term Frequency Coherence and Coverage Ratio perform best while LLM-based evaluation shows limited legal specialization.

Details

Motivation: The rapid advancement of AI in legal NLP demands scalable methods for evaluating text extraction from judicial decisions without requiring pre-annotated ground truth.

Method: Evaluated 16 unsupervised metrics across document-based, semantic, structural, pseudo-ground truth, and legal-specific categories on 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews using bootstrapped correlations, Lin’s concordance correlation coefficient, and mean absolute error.

Result: Term Frequency Coherence (Pearson r=0.540, Lin CCC=0.512, MAE=0.127) and Coverage Ratio/Block Completeness (Pearson r=0.513, Lin CCC=0.443, MAE=0.139) best aligned with expert ratings, while Legal Term Density showed strong negative correlations. LLM Evaluation Score showed moderate alignment but limited legal specialization.

Conclusion: Unsupervised metrics enable scalable screening but cannot fully replace human judgment in high-stakes legal contexts due to moderate correlations and low CCC values. This work advances legal NLP by providing annotation-free evaluation tools.

Abstract: The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1–5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin’s concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

[66] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

Xin Liu, Rongwu Xu, Xinyi Jia, Jason Liao, Jiao Sun, Ling Huang, Wei Xu

Main category: cs.CL

TL;DR: FraudSquad is a hybrid detection model that combines text embeddings with graph transformers to detect LLM-generated spam reviews, outperforming state-of-the-art methods by up to 44.22% in precision.

Details

Motivation: LLMs can generate highly persuasive spam reviews that mimic human writing, posing significant threats to online platform credibility and challenging existing detection systems.

Method: Created three LLM-generated spam review datasets using different LLMs guided by product metadata and reference reviews. Proposed FraudSquad - a hybrid model integrating text embeddings from pre-trained language models with gated graph transformers for spam node classification.

Result: FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on LLM-generated datasets, while also achieving promising results on human-written spam datasets. It maintains modest model size and requires minimal labeled training data.

Conclusion: The work provides new synthetic datasets, a practical detection framework, and evidence highlighting the urgency of adapting spam detection to the LLM era. FraudSquad offers a practical solution for real-world applications.

Abstract: The rise of large language models (LLMs) has enabled the generation of highly persuasive spam reviews that closely mimic human writing. These reviews pose significant challenges for existing detection systems and threaten the credibility of online platforms. In this work, we first create three realistic LLM-generated spam review datasets using three distinct LLMs, each guided by product metadata and genuine reference reviews. Evaluations by GPT-4.1 confirm the high persuasion and deceptive potential of these reviews. To address this threat, we propose FraudSquad, a hybrid detection model that integrates text embeddings from a pre-trained language model with a gated graph transformer for spam node classification. FraudSquad captures both semantic and behavioral signals without relying on manual feature engineering or massive training resources. Experiments show that FraudSquad outperforms state-of-the-art baselines by up to 44.22% in precision and 43.01% in recall on three LLM-generated datasets, while also achieving promising results on two human-written spam datasets. Furthermore, FraudSquad maintains a modest model size and requires minimal labeled training data, making it a practical solution for real-world applications. Our contributions include new synthetic datasets, a practical detection framework, and empirical evidence highlighting the urgency of adapting spam detection to the LLM era. Our code and datasets are available at: https://anonymous.4open.science/r/FraudSquad-5389/.

Dane Williamson, Yangfeng Ji, Matthew Dwyer

Main category: cs.CL

TL;DR: LLMs fail on mathematically simple problems when phrased in unfamiliar syntax, showing systematic “syntactic blind spots” where they misapply reasoning strategies due to structural complexity rather than conceptual difficulty.

Details

Motivation: To understand why LLMs fail on mathematically straightforward problems when they are phrased differently from training data, identifying that failures stem from syntactic complexity rather than lack of mathematical competence.

Method: Rephrase incorrectly answered questions using syntactic templates from correct examples, quantify syntactic complexity using Dependency Locality Theory (DLT) metric, and test across multiple datasets.

Result: Rephrased questions with reduced structural complexity often lead to correct answers, and higher DLT scores correlate with increased failure rates, showing that syntax-aware interventions can mitigate reasoning errors.

Conclusion: Many LLM reasoning errors result from structural misalignment between surface form and internal representation, not conceptual difficulty, and syntax-aware approaches can effectively address these inductive failures.

Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.

[68] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Shicheng Liu, Kai Sun, Lisheng Fu, Xilun Chen, Xinyuan Zhang, Zhaojiang Lin, Rulin Shao, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

Main category: cs.CL

TL;DR: SCRIBES is a reinforcement learning framework that generates reusable extraction scripts for semi-structured web content by leveraging layout similarity across webpages, enabling scalable and resource-efficient information extraction without per-page LLM inference.

Details

Motivation: Semi-structured content in HTML tables, lists, and infoboxes contains substantial factual data but is difficult to extract reliably. Existing methods either lack generalization or are resource-intensive due to requiring LLM inference for each individual page.

Method: Uses reinforcement learning with layout similarity across webpages within the same site as a reward signal. Generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Iteratively trains on synthetic annotations from CommonCrawl data.

Result: Outperforms strong baselines by over 13% in script quality. Boosts downstream question answering accuracy by more than 4% for GPT-4o. Enables scalable and resource-efficient web information extraction.

Conclusion: SCRIBES provides an effective solution for extracting semi-structured web content at scale by generating reusable scripts that leverage structural similarities, avoiding the need for per-page LLM inference while maintaining high extraction quality.

Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

[69] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Ece Takmaz, Lisa Bylinina, Jakub Dotlacil

Main category: cs.CL

TL;DR: The paper addresses the performance gap in language-only tasks for multimodal models by using model merging techniques with language-only models to improve grammar performance while maintaining multimodal capabilities.

Details

Motivation: State-of-the-art vision-and-language models use much more data than children learn from, and multimodal models tend to underperform in language-only tasks, particularly grammar.

Method: Developed language-only and multimodal models using developmentally plausible datasets, and experimented with model merging via weighted linear interpolation to fuse multimodal and language-only model parameters.

Result: Multimodal models outperformed previous BabyLM baselines but underperformed in language-only grammar benchmarks. Model merging with text-only models helped alleviate this problem while maintaining multimodal performance.

Conclusion: Model merging with text-only models can partially address the language-only performance gap in multimodal models, maintaining their multimodal capabilities while improving grammar performance.

Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.

[70] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Yisu Wang, Ming Wang, Haoyuan Song, Wenjie Huang, Chaozheng Wang, Yi Xie, Xuming Ran

Main category: cs.CL

TL;DR: REPAIR is a lifelong editing framework for LLMs that enables precise, low-cost model updates while preserving non-target knowledge through closed-loop feedback and dynamic memory management.

Details

Motivation: To address the high cost of acquiring new knowledge or correcting errors in LLMs and mitigate unintended side effects from retraining.

Method: Uses closed-loop feedback mechanism, dynamic memory management, frequent knowledge fusion, and strong locality guards to handle sequential edits and prevent unintended ripple effects.

Result: Boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting.

Conclusion: REPAIR provides a robust framework for developing reliable, scalable, and continually evolving LLMs.

Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.

[71] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Qiyuan Liu, Hao Xu, Xuhong Chen, Wei Chen, Yee Whye Teh, Ning Miao

Main category: cs.CL

TL;DR: This paper provides a systematic introduction and comprehensive survey of reward models (RMs) for enhancing LLM reasoning, covering their architectures, training methods, applications, and open research questions.

Details

Motivation: Reward models play a critical role in improving LLM reasoning performance by providing training signals for reinforcement learning and helping select best answers during inference, but there's a need for systematic understanding of their applications and challenges.

Method: The paper reviews fundamental concepts of RMs including architectures, training methodologies, and evaluation techniques, then explores their key applications in guiding generation, data synthesis, and RL-based finetuning.

Result: The analysis provides actionable insights for effective deployment and advancement of RMs for LLM reasoning, addressing critical open questions regarding selection, generalization, evaluation, and enhancement of RMs.

Conclusion: This comprehensive survey establishes a foundation for understanding and advancing reward models in LLM reasoning applications, highlighting both current applications and future research directions.

Abstract: Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.

[72] Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli, Simone Sestito, Iacopo Masi

Main category: cs.CL

TL;DR: Proposes Inverse Language Modeling (ILM) as a unified framework to improve LLM robustness against input perturbations and enable native grounding by identifying toxic input triggers.

Details

Motivation: Current defensive mechanisms for LLMs are fragmented and underdeveloped compared to prior classifier work, needing better adversarial robustness.

Method: Inverse Language Modeling (ILM) framework that transforms LLMs from static generators into analyzable systems by inverting model outputs to identify unsafe input triggers.

Result: ILM simultaneously improves LLM robustness to input perturbations and enables native grounding for identifying potentially toxic input triggers.

Conclusion: ILM can serve as foundation for next-generation LLMs that are more robust, grounded, controllable, and trustworthy, potentially helping red teaming efforts.

Abstract: The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.

[73] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

Qi He, Cheng Qian, Xiusi Chen, Bingxiang He, Yi R., Fung, Heng Ji

Main category: cs.CL

TL;DR: Veri-R1 is an online reinforcement learning framework that trains LLMs to interact with search engines for claim verification, improving joint accuracy by up to 30% and doubling evidence scores.

Details

Motivation: Existing claim verification approaches rely on prompt engineering or predefined workflows without unified training to improve necessary skills like evidence retrieval and reasoning.

Method: Online reinforcement learning framework where LLMs interact with search engines and receive reward signals to shape planning, retrieval, and reasoning behaviors.

Result: Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale models. Ablation studies show impact of reward components and link between output logits and accuracy.

Conclusion: Online RL is effective for precise and faithful claim verification, providing foundation for future research in LLM-empowered verification.

Abstract: Claim verification with large language models (LLMs) has recently attracted considerable attention, owing to their superior reasoning capabilities and transparent verification pathways compared to traditional answer-only judgments. Online claim verification requires iterative evidence retrieval and reasoning, yet existing approaches mainly rely on prompt engineering or predesigned reasoning workflows without offering a unified training paradigm to improve necessary skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL) framework that enables an LLM to interact with a search engine and to receive reward signals that explicitly shape its planning, retrieval, and reasoning behaviors. The dynamic interaction between models and retrieval systems more accurately reflects real-world verification scenarios and fosters comprehensive verification skills. Empirical results show that Veri-R1 improves joint accuracy by up to 30% and doubles evidence score, often surpassing larger-scale counterparts. Ablation studies further reveal the impact of reward components and the link between output logits and label accuracy. Our results highlight the effectiveness of online RL for precise and faithful claim verification and provide a foundation for future research. We release our code to support community progress in LLM empowered claim verification.

[74] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

Adina Nicola Dobrinoiu, Ana Cristiana Marcu, Amir Homayounirad, Luciano Cavalcante Siebert, Enrico Liscio

Main category: cs.CL

TL;DR: Language models can predict individual value interpretations using multi-dimensional subjective annotations (SEAT dimensions: Sentiment, Emotion, Argument, Topics) as proxies for personal interpretive lenses, outperforming individual dimensions and baseline approaches.

Details

Motivation: To develop AI systems that align with diverse human perspectives and avoid majority bias by recognizing individual value interpretations shaped by sociocultural backgrounds and lived experiences.

Method: Evaluate language models in zero- and few-shot settings using SEAT dimensions (Sentiment, Emotion, Argument, Topics) as proxies for individual interpretive lenses to predict value interpretations.

Result: Providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and baseline approaches. Individual variations across annotators highlight the importance of accounting for subjective annotation behavior.

Conclusion: This first attempt to go beyond demographics and investigate annotation behavior’s impact on value prediction provides a foundation for future large-scale validation of AI systems that can better align with diverse human perspectives.

Abstract: Our interpretation of value concepts is shaped by our sociocultural background and lived experiences, and is thus subjective. Recognizing individual value interpretations is important for developing AI systems that can align with diverse human perspectives and avoid bias toward majority viewpoints. To this end, we investigate whether a language model can predict individual value interpretations by leveraging multi-dimensional subjective annotations as a proxy for their interpretive lens. That is, we evaluate whether providing examples of how an individual annotates Sentiment, Emotion, Argument, and Topics (SEAT dimensions) helps a language model in predicting their value interpretations. Our experiment across different zero- and few-shot settings demonstrates that providing all SEAT dimensions simultaneously yields superior performance compared to individual dimensions and a baseline where no information about the individual is provided. Furthermore, individual variations across annotators highlight the importance of accounting for the incorporation of individual subjective annotators. To the best of our knowledge, this controlled setting, although small in size, is the first attempt to go beyond demographics and investigate the impact of annotation behavior on value prediction, providing a solid foundation for future large-scale validation.

[75] Exploring Database Normalization Effects on SQL Generation

Ryosuke Kohita

Main category: cs.CL

TL;DR: Schema normalization significantly impacts NL2SQL performance. Denormalized schemas work well for simple retrieval queries, while normalized schemas perform better for aggregation queries. The optimal schema design depends on query types.

Details

Motivation: Schema design, particularly normalization, is a critical but overlooked factor in NL2SQL systems. Most prior research evaluates models on fixed schemas without considering design impact.

Method: Systematic study evaluating eight leading LLMs on synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets. Analyzed performance across different normalization levels with zero-shot and few-shot settings.

Result: Denormalized schemas offer high accuracy for simple retrieval queries. Normalized schemas (2NF/3NF) cause base table selection and join prediction errors, but these are mitigated by few-shot examples. Normalized schemas perform better for aggregation queries due to robustness against data duplication and NULL values.

Conclusion: Optimal schema design for NL2SQL depends on query types. Schema design should be considered when developing NL2SQL interfaces, with adaptive schema selection for real-world scenarios.

Abstract: Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization’s impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.

[76] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

Md Arid Hasan, Firoj Alam, Md Fahad Hossain, Usman Naseem, Syed Ishtiaque Ahmed

Main category: cs.CL

TL;DR: The paper introduces BanglaMultiHate, the first multi-task Bangla hate-speech dataset, and compares classical models, monolingual pretrained models, and LLMs for hate speech detection in low-resource settings.

Details

Motivation: Online platforms need reliable hate speech detection systems, especially for low-resource languages like Bangla where existing tools are limited and most prior work focuses on single-task detection with limited coverage of multi-facet signals.

Method: Created BanglaMultiHate dataset with manual annotations, then conducted comprehensive comparison of classical baselines, monolingual pretrained models, and LLMs using zero-shot prompting and LoRA fine-tuning.

Result: LoRA-tuned LLMs are competitive with BanglaBERT, but culturally and linguistically grounded pretraining remains critical for robust performance in hate speech detection.

Conclusion: The dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts, with the dataset and scripts to be released for reproducibility.

Abstract: Online social media platforms are central to everyday communication and information seeking. While these platforms serve positive purposes, they also provide fertile ground for the spread of hate speech, offensive language, and bullying content targeting individuals, organizations, and communities. Such content undermines safety, participation, and equity online. Reliable detection systems are therefore needed, especially for low-resource languages where moderation tools are limited. In Bangla, prior work has contributed resources and models, but most are single-task (e.g., binary hate/offense) with limited coverage of multi-facet signals (type, severity, target). We address these gaps by introducing the first multi-task Bangla hate-speech dataset, BanglaMultiHate, one of the largest manually annotated corpus to date. Building on this resource, we conduct a comprehensive, controlled comparison spanning classical baselines, monolingual pretrained models, and LLMs under zero-shot prompting and LoRA fine-tuning. Our experiments assess LLM adaptability in a low-resource setting and reveal a consistent trend: although LoRA-tuned LLMs are competitive with BanglaBERT, culturally and linguistically grounded pretraining remains critical for robust performance. Together, our dataset and findings establish a stronger benchmark for developing culturally aligned moderation tools in low-resource contexts. For reproducibility, we will release the dataset and all related scripts.

[77] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models

Donghoon Jung, Jiwoo Choi, Songeun Chae, Seohyon Jung

Main category: cs.CL

TL;DR: This paper analyzes LLM creativity through a process-oriented lens using constraint-based decision-making and authorial personas, finding that models consistently prioritize Style over other narrative elements.

Details

Motivation: Current evaluations of LLM creativity focus mainly on output quality rather than the creative processes that generate them, leaving a gap in understanding how LLMs make creative decisions as computational authors.

Method: The study uses controlled prompting to assign authorial personas to LLMs and analyzes their creative preferences through constraint-based decision-making, drawing on narratology to examine LLMs as computational authors.

Result: LLMs consistently emphasize Style over other narrative elements (Character, Event, Setting), and distinctive creative profiles emerge across different models based on their reasoning for choices.

Conclusion: The approach provides a novel systematic tool for analyzing AI’s authorial creativity, revealing distinctive creative preferences and decision-making processes in LLMs that go beyond simple output evaluation.

Abstract: Evaluations of large language models (LLMs)’ creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI’s authorial creativity.

[78] The Disparate Impacts of Speculative Decoding

Jameson Sandler, Ahmet Üstün, Marco Romanelli, Sara Hooker, Ferdinando Fioretto

Main category: cs.CL

TL;DR: Speculative decoding shows unequal speed-up rates across tasks, with under-fit/underrepresented tasks benefiting less. The paper analyzes this unfairness and proposes a mitigation strategy.

Details

Motivation: To understand and address the disparate speed-up rates in speculative decoding, where under-fit and underrepresented tasks consistently see diminished benefits compared to well-fit tasks.

Method: Derived analytical framework to quantify unfairness, identified factors causing disparate speed-ups, and proposed a mitigation strategy to reduce speed-up disparities.

Result: Validated the mitigation approach across several model pairs, achieving on average a 12% improvement in fairness metric.

Conclusion: Speculative decoding exhibits systematic unfairness in speed-up distribution, but this can be mitigated through targeted strategies to improve fairness across different tasks.

Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed unfairness’’ and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

[79] RESTRAIN: From Spurious Votes to Signals – Self-Driven RL with Self-Penalization

Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, Jing Xu

Main category: cs.CL

TL;DR: RESTRAIN is a self-penalizing RL framework that enables models to improve reasoning without gold labels by exploiting the model’s answer distribution to penalize overconfident rollouts and low-consistency examples while preserving promising reasoning chains.

Details

Motivation: Current reinforcement learning with human-annotated data is costly and falters on harder tasks, creating a need for experience-driven learning where models can improve without curated labels by adapting to unlabeled data.

Method: RESTRAIN uses a self-penalization mechanism that converts the absence of gold labels into a learning signal, penalizing overconfident rollouts and low-consistency examples while preserving good reasoning chains. It integrates with policy optimization methods like GRPO for continual self-improvement without supervision.

Result: RESTRAIN achieves significant improvements on challenging reasoning benchmarks: up to +140.7% Pass@1 on AIME25, +36.2% on MMLU_STEM, and +19.6% on GPQA-Diamond using only unlabeled data, nearly matching gold-label training performance.

Conclusion: RESTRAIN establishes a scalable path toward stronger reasoning capabilities without requiring gold labels, demonstrating the effectiveness of self-penalizing reinforcement learning for unsupervised model improvement.

Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

[80] Learning to Reason for Hallucination Span Detection

Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, Raviteja Vemulapalli

Main category: cs.CL

TL;DR: RL4HS is a reinforcement learning framework that uses span-level rewards and explicit reasoning to detect hallucination spans in LLM outputs, outperforming pretrained models and supervised fine-tuning.

Details

Motivation: LLMs often generate hallucinations that undermine reliability. While most prior work treats hallucination detection as binary classification, real applications need to identify specific hallucinated spans, which requires multi-step reasoning.

Method: Proposed RL4HS framework using reinforcement learning with span-level reward function, building on Group Relative Policy Optimization and introducing Class-Aware Policy Optimization to address reward imbalance.

Result: Experiments on RAGTruth benchmark (summarization, QA, data-to-text) show RL4HS surpasses pretrained reasoning models and supervised fine-tuning.

Conclusion: Reinforcement learning with span-level rewards is necessary for effective detection of hallucination spans in LLM outputs.

Abstract: Large language models (LLMs) often generate hallucinations – unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

[81] ARUQULA – An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities

Felix Brei, Lorenz Bühmann, Johannes Frey, Daniel Gerber, Lars-Peter Meyer, Claus Stadler, Kirill Bulert

Main category: cs.CL

TL;DR: The paper presents SPINACH, an LLM-based agent that translates natural language questions to SPARQL queries through iterative exploration and execution, addressing the high barrier of SPARQL for non-experts.

Details

Motivation: To lower the barrier of SPARQL query language for people without computer science background by using large language models for Text2SPARQL translation, motivated by the Text2SPARQL challenge.

Method: A generalized method based on SPINACH, an LLM-backed agent that translates natural language to SPARQL queries through an iterative process of exploration and execution rather than single-shot translation.

Result: The paper describes the overall architecture and design decisions, and conducts thorough analysis of agent behavior to identify areas for future improvements.

Conclusion: The iterative approach to Text2SPARQL translation shows promise for making knowledge graph interaction more accessible, with identified areas for targeted improvements based on behavioral analysis.

Abstract: Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.

[82] Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use Agents

Lingzhong Dong, Ziqi Zhou, Shuaibo Yang, Haiyue Sheng, Pengzhou Cheng, Zongru Wu, Zheng Wu, Gongshen Liu, Zhuosheng Zhang

Main category: cs.CL

TL;DR: This paper introduces a new evaluation framework to diagnose reasoning-execution gaps in mobile-use agents powered by vision-language models, focusing on whether chain-of-thought reasoning aligns with ground-truth actions.

Details

Motivation: Existing evaluations emphasize execution accuracy but neglect whether CoT reasoning aligns with ground-truth actions, which can lead to over-trust and potential harmful actions when users rely on seemingly plausible but misaligned reasoning.

Method: The authors propose Ground-Truth Alignment (GTA) metric to measure whether the action implied by CoT matches the ground-truth action, and combine it with Exact Match (EM) to jointly assess reasoning and execution accuracy.

Result: Experimental results show reasoning-execution gaps are prevalent across mobile interaction tasks, with execution gaps occurring more frequently than reasoning gaps. While scaling up model size reduces overall gaps, sizable execution gaps persist even in largest models.

Conclusion: The framework reliably reflects systematic reasoning-execution gap patterns in state-of-the-art models, offering concrete diagnostics to support development of more trustworthy mobile-use agents.

Abstract: Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.

[83] More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Hengtao Shen

Main category: cs.CL

TL;DR: AMPO introduces adaptive multi-teacher guidance for RLVR that provides guidance only when needed, improving reasoning diversity and performance over single-teacher methods.

Details

Motivation: Current RLVR methods rely on self-exploration or single off-policy teachers, which can introduce model biases and limit reasoning diversity and performance.

Method: AMPO adaptively leverages guidance from multiple proficient teacher models only when the on-policy model fails, with a comprehension-based selection mechanism for effective learning.

Result: AMPO achieves 4.3% improvement on mathematical reasoning and 12.2% on out-of-distribution tasks, with better Pass@k performance and more diverse exploration than GRPO baseline.

Conclusion: AMPO demonstrates a more efficient and scalable path to superior reasoning and generalizability, achieving comparable results to single powerful teachers with less data.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This “guidance-on-demand” approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

[84] Enhanced Arabic-language cyberbullying detection: deep embedding and transformer (BERT) approaches

Ebtesam Jaber Aljohani, Wael M. S. Yafoo

Main category: cs.CL

TL;DR: This paper addresses the scarcity of Arabic-language cyberbullying detection methods by developing deep learning models that achieve up to 98% accuracy using Bi-LSTM with FastText embeddings.

Details

Motivation: The growth of social media platforms like X (formerly Twitter) endangers young people's emotional well-being through cyberbullying. While many cyberbullying detection methods exist for English, there is a significant lack of effective approaches for Arabic-language content.

Method: The researchers assembled a dataset of 10,662 X posts, pre-processed the data, and used kappa tool for annotation quality verification. They conducted four experiments testing various deep learning models including LSTM, Bi-LSTM with different word embeddings, and hybrid models combining LSTM/Bi-LSTM with BERT embeddings.

Result: LSTM-BERT and Bi-LSTM-BERT models demonstrated 97% accuracy, while Bi-LSTM with FastText embedding achieved even better performance with 98% accuracy. The results show strong generalization capabilities.

Conclusion: The proposed deep learning models effectively detect cyberbullying in Arabic-language content, with Bi-LSTM using FastText embeddings achieving the highest accuracy of 98%, addressing the critical gap in Arabic cyberbullying detection methods.

Abstract: Recent technological advances in smartphones and communications, including the growth of such online platforms as massive social media networks such as X (formerly known as Twitter) endangers young people and their emotional well-being by exposing them to cyberbullying, taunting, and bullying content. Most proposed approaches for automatically detecting cyberbullying have been developed around the English language, and methods for detecting Arabic-language cyberbullying are scarce. Methods for detecting Arabic-language cyberbullying are especially scarce. This paper aims to enhance the effectiveness of methods for detecting cyberbullying in Arabic-language content. We assembled a dataset of 10,662 X posts, pre-processed the data, and used the kappa tool to verify and enhance the quality of our annotations. We conducted four experiments to test numerous deep learning models for automatically detecting Arabic-language cyberbullying. We first tested a long short-term memory (LSTM) model and a bidirectional long short-term memory (Bi-LSTM) model with several experimental word embeddings. We also tested the LSTM and Bi-LSTM models with a novel pre-trained bidirectional encoder from representations (BERT) and then tested them on a different experimental models BERT again. LSTM-BERT and Bi-LSTM-BERT demonstrated a 97% accuracy. Bi-LSTM with FastText embedding word performed even better, achieving 98% accuracy. As a result, the outcomes are generalize

[85] AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo, Dat Quoc Nguyen

Main category: cs.CL

TL;DR: AccurateRAG is a novel framework for building high-performance question-answering applications using retrieval-augmented generation (RAG), featuring a comprehensive development pipeline and achieving state-of-the-art results.

Details

Motivation: To address the need for efficient development of high-performance question-answering systems using RAG technology with comprehensive tools and improved performance over existing baselines.

Method: A complete pipeline framework including raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and local RAG system building.

Result: Experimental results demonstrate that AccurateRAG outperforms previous strong baselines and achieves new state-of-the-art question-answering performance on benchmark datasets.

Conclusion: AccurateRAG provides an effective framework for developing high-performance RAG-based question-answering applications with superior performance compared to existing approaches.

Abstract: We introduce AccurateRAG – a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

[86] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation

Tianyi Jiang, Yi Bin, Yujuan Ding, Kainian Zhu, Fei Ma, Jingkuan Song, Heng Tao Shen

Main category: cs.CL

TL;DR: The paper introduces TECA metric and CER mechanism to address LLM overthinking by dynamically determining when to stop reasoning, reducing response length by up to 71% on simpler problems while maintaining performance.

Details

Motivation: LLMs often generate unnecessarily lengthy reasoning steps for simpler problems (overthinking), which degrades efficiency and makes it difficult to adapt reasoning depth to problem complexity.

Method: Proposes Token Entropy Cumulative Average (TECA) metric to measure reasoning exploration, and Cumulative Entropy Regulation (CER) mechanism implementing ‘Explore Briefly, Then Decide’ paradigm to dynamically determine optimal stopping point.

Result: Experimental results show substantial reduction in overthinking without sacrificing problem-solving ability, with average response length decreasing by up to 71% on simpler datasets.

Conclusion: The approach creates a more efficient and adaptive reasoning process for LLMs by preventing unnecessary lengthy reasoning for simpler problems.

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm – Explore Briefly, Then Decide – with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.

[87] InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Siheng Chen

Main category: cs.CL

TL;DR: InfoMosaic-Bench is the first benchmark for multi-source information seeking in tool-augmented agents, requiring integration of general-purpose search with domain-specific tools across six domains, revealing current LLMs’ limitations in effective tool usage.

Details

Motivation: Existing LLM agents rely heavily on noisy and unreliable open-web search, lacking access to precise domain-specific knowledge. While MCP enables interface with specialized tools, it's unclear if agents can effectively leverage them and integrate with general search for complex tasks.

Method: Created InfoMosaic-Bench covering six domains (medicine, finance, maps, video, web, multi-domain) with tasks synthesized using InfoMosaic-Flow pipeline that grounds conditions in verified tool outputs, enforces cross-source dependencies, and filters trivial lookup cases.

Result: Experiments with 14 SOTA LLM agents show: web-only achieves 38.2% accuracy; domain tools provide selective benefits; 22.4% failures due to incorrect tool usage/selection, highlighting LLMs’ struggle with basic tool handling.

Conclusion: Current LLM agents still face significant challenges in effectively leveraging domain-specific tools and integrating them with general search, with fundamental issues in tool usage and selection that need addressing.

Abstract: Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

[88] Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

Wen Yang, Junhong Wu, Chong Li, Chengqing Zong, Jiajun Zhang

Main category: cs.CL

TL;DR: This paper investigates whether reasoning capabilities from English Reinforcement Post-Training (RPT) transfer to other languages, revealing significant cross-lingual transferability variations and proposing metrics to quantify this transfer.

Details

Motivation: To address the gap in understanding cross-linguistic reasoning generalization in Large Reasoning Models (LRMs), challenging the assumption that LRM reasoning mirrors human cognition across languages.

Method: Systematic evaluation of English-centric LRMs on multilingual reasoning benchmarks, introducing cross-lingual transferability metrics, and conducting interventional studies including parallel training experiments.

Result: Cross-lingual transferability varies significantly by model, language, and training paradigm. Models with stronger English capabilities show diminished cross-lingual generalization. Parallel training reveals three key findings: First-Parallel Leap, Parallel Scaling Law, and Monolingual Generalization Gap.

Conclusion: English-centric LRMs fail to fully generalize across languages, challenging assumptions about LRM reasoning mirroring human cognition, and providing insights for developing more language-agnostic models.

Abstract: Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.

[89] From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi

Main category: cs.CL

TL;DR: VLM-Lens is a toolkit for systematic benchmarking and analysis of vision-language models by extracting intermediate outputs from any layer during forward pass.

Details

Motivation: To enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by providing a unified interface for extracting intermediate outputs.

Method: Provides a YAML-configurable interface that abstracts model-specific complexities, supports extraction of intermediate outputs from any layer, and integrates with various interpretability methods.

Result: Currently supports 16 state-of-the-art base VLMs and over 30 variants, with demonstrated usage revealing systematic differences in hidden representations across layers and target concepts.

Conclusion: VLM-Lens is released as open-source to accelerate community efforts in understanding and improving VLMs through systematic analysis and benchmarking.

Abstract: We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

[90] F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Main category: cs.CL

TL;DR: F2LLM is a suite of embedding models (0.6B, 1.7B, 4B) that achieves state-of-the-art performance by directly finetuning foundation models on open-source data, avoiding costly synthetic training and complex pipelines.

Details

Motivation: To create high-performance embedding models that are more cost-effective and reproducible than existing approaches that require massive contrastive pretraining, sophisticated pipelines, and expensive synthetic data.

Method: Directly finetune foundation models on 6 million query-document-negative tuples from open-source, non-synthetic datasets, using a simpler training approach compared to previous methods.

Result: F2LLM-4B ranks 2nd among 4B-parameter models and 7th overall on MTEB English leaderboard; F2LLM-1.7B ranks 1st in the 1B-2B size range, demonstrating strong performance with reduced training costs.

Conclusion: F2LLM provides a strong, reproducible, and budget-friendly baseline for embedding models, with released models, training data, and code to facilitate future research.

Abstract: We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

[91] Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Raphael Tang, Crystina Zhang, Wenyan Li, Carmen Lai, Pontus Stenetorp, Yao Lu

Main category: cs.CL

TL;DR: Current Elo rating systems treat draws as indicating equal model performance, but this paper argues draws actually reflect query difficulty. Ignoring draws in rating updates improves prediction accuracy by 1-3%.

Details

Motivation: To challenge the conventional interpretation of draws in LLM arena evaluations, questioning whether draws truly indicate equal model performance or rather reflect query difficulty.

Method: Analyzed three real-world arena datasets, tested four rating systems while ignoring rating updates for draws, and examined query properties like difficulty and objectivity.

Result: Ignoring draws in rating updates yielded 1-3% relative increase in battle outcome prediction accuracy. Draws occurred more frequently for very easy queries (risk ratio 1.37) and highly objective queries (risk ratio 1.35).

Conclusion: Future rating systems should reconsider draw semantics and incorporate query properties in rating updates, as draws are more indicative of query difficulty than equal model performance.

Abstract: In arena-style evaluation of large language models (LLMs), two LLMs respond to a user query, and the user chooses the winning response or deems the “battle” a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

[92] Superficial Safety Alignment Hypothesis

Jianwei Li, Jung-Eun Kim

Main category: cs.CL

TL;DR: The paper proposes the Superficial Safety Alignment Hypothesis (SSAH), suggesting safety alignment in LLMs is a simple binary classification task that can be achieved by identifying and manipulating specific neuron-level components.

Details

Motivation: Current LLM alignment approaches focus on general instruction-following but overlook the brittleness of safety mechanisms, creating a need for more robust safety alignment methods.

Method: Proposed SSAH hypothesis and identified four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Used component freezing and leveraging redundant units as alignment budget.

Result: Freezing safety-critical components during fine-tuning preserves safety attributes while adapting to new tasks. Using redundant units minimizes alignment tax while achieving alignment goals.

Conclusion: Safety alignment in LLMs operates at the neuron level and should not be complicated, with atomic functional units serving as the basis for effective safety mechanisms.

Abstract: As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction - fulfill or refuse users’ requests - interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an “alignment budget” can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated.

[93] Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching

Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee

Main category: cs.CL

TL;DR: Code-switching can activate knowledge in LLMs for low-resource language tasks, especially in language-specific domains.

Details

Motivation: LLMs are English-centric due to training data dominance, and low-resource languages face limited resources. Code-switching can convey cultural nuances and elicit language-specific knowledge.

Method: Created EnKoQA (English-Korean code-switching QA dataset), analyzed multilingual LLMs by dividing activation into knowledge identification and leveraging.

Result: Code-switching activates knowledge more effectively than English text, particularly in language-specific domains.

Conclusion: Code-switching shows potential for improving LLM performance on low-resource language tasks.

Abstract: Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can activate, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification and knowledge leveraging. Our results demonstrate that compared to English text, CS can faithfully activate knowledge inside LLMs especially on language-specific domains, suggesting the potential of code-switching on low-resource language tasks.

[94] Self-Consistency Falls Short! The Adverse Effects of Positional Bias on Long-Context Problems

Adam Byerly, Daniel Khashabi

Main category: cs.CL

TL;DR: Self-consistency (SC) degrades performance on long-context tasks due to persistent position bias, unlike its benefits for short-context tasks.

Details

Motivation: To challenge the assumption that SC's benefits generalize to long-context settings where LLMs struggle with position bias.

Method: Comprehensive experimentation with varying state-of-the-art models, tasks, and SC formulations on long-context problems.

Result: SC not only fails to improve but actively degrades performance on long-context tasks, with degradation worsening with longer contexts and smaller models.

Conclusion: Current LLMs have limitations in long-context understanding, and SC amplifies positional errors rather than diversifying reasoning paths, highlighting the need for more sophisticated approaches.

Abstract: Self-consistency (SC) improves the performance of large language models (LLMs) across various tasks and domains that involve short content. However, does this support its effectiveness for long-context problems? We challenge the assumption that SC’s benefits generalize to long-context settings, where LLMs often struggle with position bias, the systematic over-reliance on specific context regions-which hinders their ability to utilize information effectively from all parts of their context. Through comprehensive experimentation with varying state-of-the-art models, tasks, and SC formulations, we find that SC not only fails to improve but actively degrades performance on long-context tasks. This degradation is driven by persistent position bias, which worsens with longer context lengths and smaller model sizes but remains invariant to prompt format or task type. Unlike short-context tasks, where SC diversifies reasoning paths, long-context SC amplifies positional errors. These comprehensive results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches.

[95] Reasoning over User Preferences: Knowledge Graph-Augmented LLMs for Explainable Conversational Recommendations

Zhangchi Qiu, Linhao Luo, Shirui Pan, Alan Wee-Chung Liew

Main category: cs.CL

TL;DR: COMPASS is a plug-and-play framework that synergizes LLMs and KGs to enhance explainability and performance in Conversational Recommender Systems by bridging the modality gap between structured knowledge graphs and natural language conversations.

Details

Motivation: Current CRSs use latent vector representations from KGs or language models that limit explainability, while LLMs struggle with user preference reasoning and face modality gaps when integrating with KGs.

Method: Two-stage training: 1) Graph entity captioning pre-training to bridge KG-natural language gap, 2) Knowledge-aware instruction fine-tuning where LLM learns to reason over dialogue histories and KG-augmented context to generate interpretable user preferences.

Result: Experiments on benchmark datasets show COMPASS effectively improves various CRS models by generating interpretable user preferences that enhance both recommendation performance and explainability.

Conclusion: COMPASS successfully bridges the modality gap between KGs and LLMs, enabling knowledge-aware reasoning and interpretable preference generation that can be seamlessly integrated with existing CRS models to improve both performance and transparency.

Abstract: Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by capturing user preferences through interactive dialogues. Explainability in CRSs is crucial as it enables users to understand the reasoning behind recommendations, increasing system transparency and trustworthiness. However, current CRSs often leverage knowledge graphs (KGs) or language models to extract and represent user preferences as latent vectors, which limits their explainability. Large language models (LLMs) offer powerful reasoning capabilities that can bridge this gap by generating human-understandable preference summaries. However, effectively reasoning over user preferences in CRSs remains challenging as LLMs pre-trained on large-scale corpora may not be well-suited for analyzing user preferences. While KGs provide rich domain knowledge, integrating them with LLMs encounters a significant modality gap between structured KG information and unstructured conversations. In this paper, we propose COMPASS, a plug-and-play framework that synergizes LLMs and KGs to reason over user preferences, enhancing the performance and explainability of existing CRSs. COMPASS employs a two-stage training approach: first, it bridges the gap between the structured KG and natural language through novel graph entity captioning pre-training. Next, COMPASS optimizes user preference reasoning via knowledge-aware instruction fine-tuning, where the LLM learns to reason and summarize user preferences from dialogue histories and KG-augmented context. This enables COMPASS to perform knowledge-aware reasoning and generate interpretable user preferences that can seamlessly integrate with existing CRS models for improving recommendation performance and explainability. Our experiments on benchmark datasets demonstrate the effectiveness of COMPASS in improving various CRS models.

Han Zhang, Yu Lu, Liyun Zhang, Dian Ding, Dinghua Zhao, Yi-Chao Chen, Ye Wu, Guangtao Xue

Main category: cs.CL

TL;DR: Lantern is a framework that enhances emotion recognition in dialogues by combining vanilla models with LLMs using receptive-field-aware attention weighting, achieving up to 1.80% performance improvement.

Details

Motivation: To leverage both the power of LLMs and supplementary multimedia features for better emotion understanding in dialogues, overcoming limitations of text-only processing and high computational costs.

Method: Train a multi-task vanilla model to produce emotion probabilities and dimension scores, feed these to LLMs for adjustment using external knowledge, slice dialogues into receptive fields, and merge predictions with attention-driven weighting.

Result: Experiments on IEMOCAP with 4-way and 6-way settings showed Lantern improves vanilla models (CORECT and SDT) by up to 1.23% and 1.80% when combined with GPT-4 or Llama-3.1-405B.

Conclusion: Lantern effectively enhances emotion recognition by integrating vanilla models with LLMs through receptive-field-aware attention, demonstrating significant performance improvements.

Abstract: Understanding the emotions in a dialogue usually requires external knowledge to accurately understand the contents. As the LLMs become more and more powerful, we do not want to settle on the limited ability of the pre-trained language model. However, the LLMs either can only process text modality or are too expensive to process the multimedia information. We aim to utilize both the power of LLMs and the supplementary features from the multimedia modalities. In this paper, we present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. This framework trained a multi-task vanilla model to produce probabilities of emotion classes and dimension scores. These predictions are fed into the LLMs as references to adjust the predicted probabilities of each emotion class with its external knowledge and contextual understanding. We slice the dialogue into different receptive fields, and each sample is included in exactly t receptive fields. Finally, the predictions of LLMs are merged with a receptive-field-aware attention-driven weighting module. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B. The experiments in IEMOCAP with 4-way and 6-way settings demonstrated that the Lantern can significantly improve the performance of current vanilla models by up to 1.23% and 1.80%.

[97] Initialization using Update Approximation is a Silver Bullet for Extremely Efficient Low-Rank Fine-Tuning

Kaustubh Ponkshe, Raghav Singhal, Eduard Gorbunov, Alexey Tumanov, Samuel Horvath, Praneeth Vepakomma

Main category: cs.CL

TL;DR: LoRA-SB achieves full fine-tuning performance in low-rank subspaces using optimal initialization, outperforming LoRA with 27-90x parameter reduction.

Details

Motivation: Low-rank adapters (LoRA) are efficient for fine-tuning LLMs but fall short of full fine-tuning performance. The goal is to approximate full fine-tuning within low-rank subspaces while maintaining parameter efficiency.

Method: Proposes LoRA-SB with carefully designed initialization strategy that leverages LoRA-XS architecture (inserting learnable r x r matrix between B and A). Provides optimal low-rank approximation of initial gradient and preserves update directions throughout training.

Result: Extensive experiments show LoRA-SB exceeds LoRA performance while using 27-90 times fewer parameters, and comprehensively outperforms LoRA-XS across mathematical reasoning, commonsense reasoning, and language understanding tasks.

Conclusion: It is possible to simulate full fine-tuning in low-rank subspaces, achieving significant parameter efficiency gains without sacrificing performance.

Abstract: Low-rank adapters have become standard for efficiently fine-tuning large language models, but they often fall short of achieving the performance of full fine-tuning. We propose a method, LoRA Silver Bullet or LoRA-SB, that approximates full fine-tuning within low-rank subspaces using a carefully designed initialization strategy. We theoretically demonstrate that the architecture of LoRA-XS, which inserts a learnable r x r matrix between B and A while keeping other matrices fixed, provides the precise conditions needed for this approximation. We leverage its constrained update space to achieve optimal scaling for high-rank gradient updates while removing the need for scaling factor tuning. We prove that our initialization offers an optimal low-rank approximation of the initial gradient and preserves update directions throughout training. Extensive experiments across mathematical reasoning, commonsense reasoning, and language understanding tasks demonstrate that our approach exceeds the performance of LoRA (and baselines) while using 27-90 times fewer learnable parameters, and comprehensively outperforms LoRA-XS. Our findings establish that it is possible to simulate full fine-tuning in low-rank subspaces, and achieve significant parameter efficiency gains without sacrificing performance. Our code is publicly available at: https://github.com/CERT-Lab/lora-sb.

[98] Adapting Large Language Models for Character-based Augmentative and Alternative Communication

Dylan Gaines, Keith Vertanen

Main category: cs.CL

TL;DR: This paper presents a method for using subword-based large language models to make accurate character predictions for Augmentative and Alternative Communication (AAC) systems, including a domain adaptation procedure using curated AAC-relevant sentences.

Details

Motivation: Users of Augmentative and Alternative Communication (AAC) systems write letter-by-letter using character language models, but most state-of-the-art large pretrained language models predict subword tokens of variable length rather than individual characters, creating a practical challenge for AAC applications.

Method: The authors developed an algorithm to produce character predictions from subword large language models (LLMs) and investigated a domain adaptation procedure using a large dataset of sentences curated based on their usefulness for spoken or written AAC communication.

Result: The proposed algorithm provides more accurate character predictions than using a classification layer, byte-level LLM, or n-gram model. The domain adaptation procedure further improves model performance on simple, conversational text relevant to AAC communication.

Conclusion: The research demonstrates a practical approach to leverage subword LLMs for character-level prediction in AAC systems, with improved accuracy through specialized domain adaptation using AAC-relevant text data.

Abstract: Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. Our algorithm for producing character predictions from a subword large language model (LLM) provides more accurate predictions than using a classification layer, a byte-level LLM, or an n-gram model. Additionally, we investigate a domain adaptation procedure based on a large dataset of sentences we curated based on scoring how useful each sentence might be for spoken or written AAC communication. We find our procedure further improves model performance on simple, conversational text.

[99] Out-of-Distribution Detection using Synthetic Data Generation

Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin

Main category: cs.CL

TL;DR: A method using LLMs to generate synthetic out-of-distribution (OOD) data for classification tasks, eliminating the need for real OOD data collection.

Details

Motivation: OOD data is typically unavailable or difficult to collect, posing challenges for reliable OOD detection in classification systems.

Method: Harnessing generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies without external OOD data sources.

Result: Dramatically lowers false positive rates (achieving perfect zero in some cases) while maintaining high in-distribution accuracy, outperforming baselines significantly on nine InD-OOD dataset pairs.

Conclusion: The approach effectively addresses OOD detection challenges by using LLM-generated synthetic data, achieving superior performance without requiring real OOD data.

Abstract: Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.

[100] Interpretable Text Embeddings and Text Similarity Explanation: A Survey

Juri Opitz, Lucas Möller, Andrianos Michail, Sebastian Padó, Simon Clematide

Main category: cs.CL

TL;DR: A comprehensive survey of interpretable text embeddings and text similarity explanation methods, analyzing approaches, evaluations, and identifying future research opportunities.

Details

Motivation: Text embeddings are fundamental in NLP but face challenges in interpretability and explaining similarities, making this an important underexplored research area.

Method: Provides a structured overview and characterization of methods for inherently interpretable text embeddings and text similarity explanation, comparing evaluation approaches.

Result: Identifies main ideas, approaches, trade-offs, and overarching lessons learned in the field of interpretable text embeddings.

Conclusion: Highlights opportunities and open challenges for future research in interpretable text embeddings and similarity explanation methods.

Abstract: Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them. In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.

[101] When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Haodong Zhao, Zhuosheng Zhang, Gongshen Liu

Main category: cs.CL

TL;DR: Disagreements in multi-agent LLM systems have dual effects: general disagreements improve success by expanding exploration, while task-critical disagreements harm single-path reasoning but have limited impact on multi-path programming due to emergent self-repair capabilities.

Details

Motivation: To understand how disagreements shape collective decision-making in multi-agent LLM systems, particularly distinguishing between general disagreements that prevent premature consensus and task-critical disagreements that can derail collaboration depending on solution path topology.

Method: Investigated two collaborative settings: collaborative reasoning (CounterFact, MQuAKE-cf) with single evidential chains, and collaborative programming (HumanEval, GAIA) with multiple valid implementations. Disagreements were instantiated as general heterogeneity among agents and task-critical counterfactual knowledge edits injected into context or parameters.

Result: General disagreements consistently improved success by encouraging complementary exploration. Task-critical disagreements substantially reduced success on single-path reasoning but had limited impact on programming where agents could choose alternative solutions. MAS frequently bypassed edited facts in programming but rarely in reasoning.

Conclusion: Disagreements have context-dependent effects: general disagreements benefit collaboration by expanding exploration, while task-critical disagreements’ impact depends on solution-path topology. Emergent self-repair capability depends on solution-path structure rather than scale alone.

Abstract: Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi-agent systems (MAS). However, it remains unclear how disagreements shape collective decision-making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task-critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE-cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among agents and as task-critical counterfactual knowledge edits injected into context or parameters. Experiments reveal that general disagreements consistently improve success by encouraging complementary exploration. By contrast, task-critical disagreements substantially reduce success on single-path reasoning, yet have a limited impact on programming, where agents can choose alternative solutions. Trace analyses show that MAS frequently bypasses the edited facts in programming but rarely does so in reasoning, revealing an emergent self-repair capability that depends on solution-path rather than scale alone. Our code is available at https://github.com/wbw625/MultiAgentRobustness.

[102] FANS – Formal Answer Selection for Natural Language Math Reasoning Using Lean4

Jiarui Yao, Ruida Wang, Tong Zhang

Main category: cs.CL

TL;DR: FANS is a framework that uses Lean4 theorem prover to enhance LLMs’ natural language math reasoning by translating questions into formal statements, verifying proofs, and assisting answer selection.

Details

Motivation: LLMs have strong text capabilities but lack verifiable reasoning in natural language math problems due to inherent ambiguity, making answers untrustworthy.

Method: Translate NL math questions into Lean4 theorem statements, prove them using Lean4 prover, verify with Lean4, and use formal logic results for answer selection.

Result: Improves accuracy by up to 1.91% on MATH-500 and 8.33% on AMC-23 datasets compared to reward-model baselines. Achieves 100% correct selection in number theory.

Conclusion: FANS successfully provides computer-verifiable solutions for LLM math reasoning and offers an alternative to reward models for answer selection.

Abstract: Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs’ ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs’ NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs’ NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.

[103] Probabilistic Reasoning with LLMs for k-anonymity Estimation

Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu

Main category: cs.CL

TL;DR: The paper introduces BRANCH, a new LLM methodology for estimating k-privacy values of text by factorizing joint probability distributions and using Bayesian networks to compute population matching probabilities.

Details

Motivation: To address the need for probabilistic reasoning under uncertainty in AI systems, specifically for estimating privacy risks in user-generated documents containing sensitive information.

Method: BRANCH factorizes joint probability distributions of personal information as random variables, estimates each factor’s probability separately using Bayesian networks, and combines them to compute k-privacy values.

Result: BRANCH successfully estimates k-values 73% of the time, a 13% improvement over o3-mini with chain-of-thought reasoning. High-variance LLM predictions are 37.47% less accurate on average.

Conclusion: The proposed BRANCH method effectively estimates privacy risks in text, and LLM uncertainty serves as a good indicator for prediction accuracy in probabilistic reasoning tasks.

Abstract: Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.

[104] TLUE: A Tibetan Language Understanding Evaluation Benchmark

Fan Gao, Cheng Huang, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Hao Wang Xiao Feng, Yongbin Yu

Main category: cs.CL

TL;DR: TLUE is the first large-scale Tibetan language understanding benchmark with multi-task evaluation and safety assessment, showing most LLMs perform below random baseline.

Details

Motivation: Tibetan is spoken by over 7 million people but remains underrepresented in LLM evaluation, creating a significant gap in inclusive language model development.

Method: Created TLUE benchmark with two components: comprehensive multi-task understanding benchmark across 5 domains and 67 subdomains, and safety benchmark covering 7 subdomains.

Result: Most state-of-the-art large language models perform below the random baseline, indicating significant challenges in Tibetan language processing.

Conclusion: TLUE provides crucial foundation for advancing Tibetan language understanding research and emphasizes the need for greater inclusivity in LLM development.

Abstract: Large language models have made tremendous progress in recent years, but low-resource languages, like Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of large language models. To address this gap, we present a \textbf{T}ibetan \textbf{L}anguage \textbf{U}nderstanding \textbf{E}valuation Benchmark, \textbf{TLUE}, the first large-scale benchmark for measuring the proficiency of LLMs in the Tibetan language. \textbf{TLUE} comprises two major components: a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and a safety benchmark encompassing 7 subdomains. Then, we evaluate a diverse set of state-of-the-art large language models. Experimental results demonstrate that most large language models perform below the random baseline, highlighting the considerable challenges they face in Tibetan language processing. \textbf{TLUE} provides a crucial foundation for advancing future research in Tibetan language understanding and highlights the importance of promoting greater inclusivity in the development of large language models.

[105] Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Craig W. Schmidt, Varshini Reddy, Chris Tanner, Yuval Pinter

Main category: cs.CL

TL;DR: BoundlessBPE is a modified BPE algorithm that overcomes pre-tokenization limitations by merging complete pretokens into superwords, resulting in more uniform token distribution and better text compression.

Details

Motivation: Pre-tokenization in modern tokenization pipelines causes skewed token distributions favoring common full words, limiting the benefits of larger vocabularies since additional tokens appear with progressively lower counts.

Method: BoundlessBPE relaxes the pretoken boundary constraint by selectively merging two complete pretokens into larger units called superwords, which are not necessarily semantically cohesive (e.g., merging ’ of’ and ’ the’ into ’ of the’).

Result: The approach achieves substantially more uniform distribution of tokens across corpora and compresses text more effectively, with up to 15% increase in bytes per token compared to standard BPE.

Conclusion: BoundlessBPE successfully addresses the limitations of pre-tokenization in BPE algorithms by enabling superword formation, leading to improved token distribution uniformity and compression efficiency.

Abstract: Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with up to a 15% increase in bytes per token.

[106] Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques

Main category: cs.CL

TL;DR: Proposes a curiosity-based intrinsic reward in multi-turn RLHF to enable LLMs to actively infer user traits and deliver personalized interactions without extensive user history.

Details

Motivation: Current RLHF methods prioritize helpfulness and safety but lack true personalization, empathy, and adaptation. Existing approaches require extensive user history, limiting effectiveness for new users.

Method: Leverages user model with curiosity-based intrinsic reward in multi-turn RLHF, encouraging LLM agents to actively infer user traits by optimizing conversations to improve user model accuracy.

Result: Significantly improves personalization in conversational recommendation tasks and educational settings, showing better generalization than traditional multi-turn RLHF while maintaining conversation quality.

Conclusion: The method provides a promising solution for creating more personalized, adaptive, and engaging conversational agents that can learn about users during interactions.

Abstract: Effective conversational agents like large language models (LLMs) must personalize their interactions to adapt to user preferences, personalities, and attributes across diverse domains like education and healthcare. Current methods like Reinforcement Learning from Human Feedback (RLHF), often prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized dialogues. Existing personalization approaches typically rely on extensive user history, limiting their effectiveness for new or context-limited users. To address these limitations, we propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF. This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model’s accuracy. Consequently, the agent delivers more personalized interactions by learning more about the user. We demonstrate our method’s effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting. We show improved generalization capabilities compared to traditional multi-turn RLHF, all while maintaining conversation quality. Our method offers a promising solution for creating more personalized, adaptive, and engaging conversational agents.

[107] WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms

Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu

Main category: cs.CL

TL;DR: Enhanced web agents with explicit rollback mechanism for better navigation in complex web environments.

Details

Motivation: Previous greedy one-way search strategies struggle to recover from erroneous states in dynamic web environments.

Method: Implemented an explicit rollback mechanism allowing agents to revert to previous states in navigation trajectory.

Result: Experiments on live web navigation benchmarks showed effectiveness in both zero-shot and fine-tuning settings.

Conclusion: The rollback mechanism provides flexibility for direct search process control, leading to effective web navigation.

Abstract: With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.

[108] Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

Main category: cs.CL

TL;DR: Automated system using LLMs and web scraping to collect road accident data from Bangladeshi news sites, addressing manual data collection issues with high accuracy and scalability.

Details

Motivation: Road traffic accidents are a major public safety issue in developing countries like Bangladesh, where existing manual data collection is fragmented, unreliable, and leads to underreporting.

Method: Four-component pipeline: automated web scraping code generation using Gemini-2.0-Flash LLM, news collection from online sources, accident classification with structured data extraction, and duplicate removal algorithm.

Result: System scraped 14 news sites over 111 days, processed 15,000+ articles, identified 705 unique accidents with 91.3% calibration and 80% validation accuracy. Chittagong had highest accidents (80), fatalities (70), and injuries (115). Peak times: 8-9 AM, 12-1 PM, 6-7 PM.

Conclusion: Demonstrates viability of LLM-powered scalable system for accurate, low-effort accident data collection, providing foundation for data-driven road safety policymaking in Bangladesh.

Abstract: Road traffic accidents remain a major public safety and socio-economic issue in developing countries like Bangladesh. Existing accident data collection is largely manual, fragmented, and unreliable, resulting in underreporting and inconsistent records. This research proposes a fully automated system using Large Language Models (LLMs) and web scraping techniques to address these challenges. The pipeline consists of four components: automated web scraping code generation, news collection from online sources, accident news classification with structured data extraction, and duplicate removal. The system uses the multimodal generative LLM Gemini-2.0-Flash for seamless automation. The code generation module classifies webpages into pagination, dynamic, or infinite scrolling categories and generates suitable Python scripts for scraping. LLMs also classify and extract key accident information such as date, time, location, fatalities, injuries, road type, vehicle types, and pedestrian involvement. A deduplication algorithm ensures data integrity by removing duplicate reports. The system scraped 14 major Bangladeshi news sites over 111 days (Oct 1, 2024 - Jan 20, 2025), processing over 15,000 news articles and identifying 705 unique accidents. The code generation module achieved 91.3% calibration and 80% validation accuracy. Chittagong reported the highest number of accidents (80), fatalities (70), and injuries (115), followed by Dhaka, Faridpur, Gazipur, and Cox’s Bazar. Peak accident times were morning (8-9 AM), noon (12-1 PM), and evening (6-7 PM). A public repository was also developed with usage instructions. This study demonstrates the viability of an LLM-powered, scalable system for accurate, low-effort accident data collection, providing a foundation for data-driven road safety policymaking in Bangladesh.

[109] OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding, Reasoning and Learning

Xiao Zhang, Huiyuan Lai, Qianru Meng, Johan Bos

Main category: cs.CL

TL;DR: OntoURL is the first comprehensive benchmark to evaluate LLMs’ capabilities in handling ontologies, assessing understanding, reasoning, and learning across 15 tasks from 40 ontologies. Current LLMs show strengths in understanding but weaknesses in reasoning and learning.

Details

Motivation: Large language models have shown remarkable capabilities but their ability to process structured symbolic knowledge like ontologies remains underexplored, creating a gap in understanding their ontological capabilities.

Method: Proposed a taxonomy of ontological capabilities and created OntoURL benchmark with 57,303 questions from 40 ontologies across 8 domains, evaluating 20 open-source LLMs on understanding, reasoning, and learning dimensions through 15 distinct tasks.

Result: Experiments revealed significant performance differences across models, tasks, and domains. LLMs showed capabilities in understanding ontological knowledge but weaknesses in reasoning and learning tasks. Human evaluation showed LLMs outperform humans in understanding and reasoning but fall short in learning tasks.

Conclusion: The findings highlight both the potential and limitations of LLMs in processing symbolic knowledge, establishing OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.

Abstract: Large language models have demonstrated remarkable capabilities across a wide range of tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs’ capabilities in handling ontologies – formal and symbolic representations of domain knowledge. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 57,303 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing capabilities in understanding ontological knowledge but weaknesses in reasoning and learning tasks. Further experiments with few-shot and chain-of-thought prompting illustrate how different prompting strategies affect model performance. Additionally, a human evaluation reveals that LLMs outperform humans in understanding and reasoning tasks but fall short in most learning tasks. These findings highlight both the potential and limitations of LLMs in processing symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.

[110] LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus

Main category: cs.CL

TL;DR: LEXam is a new benchmark for evaluating LLMs on legal reasoning using 4,886 law exam questions from 340 exams across 116 courses, showing current models struggle with structured legal reasoning.

Details

Motivation: Long-form legal reasoning remains challenging for LLMs despite recent advances, requiring better evaluation methods for structured legal analysis.

Method: Created LEXam benchmark with 4,886 law exam questions (2,841 open-ended, 2,045 multiple-choice) from 340 exams across 116 courses, using ensemble LLM-as-a-Judge evaluation with human expert validation.

Result: Current LLMs significantly struggle with open questions requiring structured, multi-step legal reasoning. The dataset effectively differentiates between models with varying capabilities.

Conclusion: LEXam provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics, with evaluation closely aligning with human expert assessments.

Abstract: Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. To address this, we introduce \textsc{LEXam}, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately, closely aligning with human expert assessments. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. We have open-sourced our code on \href{https://github.com/LEXam-Benchmark/LEXam}{GitHub} and released our data on \href{https://huggingface.co/datasets/LEXam-Benchmark/LEXam}{Hugging Face}. Project page: https://lexam-benchmark.github.io/

[111] ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models

Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma

Main category: cs.CL

TL;DR: ABBA is a new Parameter-Efficient Fine-Tuning method that reparameterizes weight updates as a Hadamard product of two learnable low-rank matrices, achieving higher expressivity and better performance than existing methods like LoRA and HiRA.

Details

Motivation: Existing PEFT methods like LoRA are constrained by low-rank decomposition limitations, and recent approaches like HiRA still depend on pre-trained model structure. There's a need for more expressive fine-tuning methods that can fully decouple from pre-trained weights.

Method: ABBA introduces a novel architecture that represents weight updates as a Hadamard product of two independently learnable low-rank matrices, completely decoupling the update from the pre-trained weights and allowing both components to be optimized freely.

Result: ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, significantly outperforming existing PEFT methods across multiple models while maintaining the same parameter budget.

Conclusion: ABBA’s fully decoupled approach enables higher expressivity and better performance than previous PEFT methods, making it a promising solution for efficient domain adaptation of large language models.

Abstract: Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.

[112] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

Main category: cs.CL

TL;DR: MolLangBench is a comprehensive benchmark for evaluating molecule-language interface tasks including recognition, editing, and generation, revealing significant limitations in current AI models.

Details

Motivation: To address the need for precise recognition, editing, and generation of molecules for both chemists and AI systems in chemical tasks.

Method: Constructed recognition tasks using automated cheminformatics tools, and curated editing and generation tasks through expert annotation and validation. Supports evaluation with different molecular representations (linear strings, images, graphs).

Result: State-of-the-art models show significant limitations: GPT-5 achieved 86.2% accuracy on recognition, 85.5% on editing, and only 43.0% on generation tasks - all below human performance levels.

Conclusion: Current AI systems have substantial shortcomings in handling basic molecular tasks, and MolLangBench aims to catalyze research toward more effective AI for chemical applications.

Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (GPT-5) achieves $86.2%$ and $85.5%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $43.0%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.

[113] BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators

Kma Solaiman

Main category: cs.CL

TL;DR: BiasLab is a dataset of 300 political news articles annotated for perceived ideological bias, with sentiment labels toward both parties and rationale indicators. It enables research on explainable bias detection and human-AI comparison.

Details

Motivation: To support the development of transparent, socially aware NLP systems by providing rich annotations for political bias perception and facilitating research on explainable bias detection.

Method: Created a dataset of 300 political news articles from a 900-document pool, annotated by crowdworkers along two scales (Democratic/Republican sentiment) with rationale indicators. Used targeted worker qualification and pilot-phase analysis. Also simulated annotation using GPT-4o for comparison.

Result: Quantified inter-annotator agreement, analyzed misalignment with source-level bias, and organized labels into interpretable subsets. Found mirrored asymmetries in AI annotations, especially in misclassifying subtly right-leaning content. Established baseline performance for perception drift prediction and rationale type classification tasks.

Conclusion: BiasLab’s rich rationale annotations enable explainable modeling of political bias and support transparent NLP systems. The released dataset, schema, and code encourage research on human-in-the-loop interpretability and explanation effectiveness evaluation.

Abstract: We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab’s rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.

[114] Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo

Main category: cs.CL

TL;DR: FIN-FORCE is a new benchmark for evaluating LLMs on forward counterfactual reasoning in financial markets, using curated news headlines to anticipate future market developments.

Details

Motivation: Forward counterfactual reasoning is valuable for anticipating market risks and opportunities in dynamic financial markets, but manual analysis is cognitively demanding, creating need for automated solutions using LLMs.

Method: Created FIN-FORCE benchmark with curated financial news headlines and structured evaluation framework to support LLM-based forward counterfactual generation and testing.

Result: Evaluated state-of-the-art LLMs and counterfactual generation methods on the benchmark, identifying limitations and providing insights for future research.

Conclusion: FIN-FORCE enables scalable automated solutions for anticipating future market developments and provides structured insights for financial decision-making, with benchmark and experimental codes publicly released.

Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. LLMs offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, FIN-FORCE-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, FIN-FORCE supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on FIN-FORCE, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research. We release the benchmark, supplementary data and all experimental codes at the following link: https://github.com/keanepotato/fin_force

[115] Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Jesujoba O. Alabi, Michael A. Hedderich, David Ifeoluwa Adelani, Dietrich Klakow

Main category: cs.CL

TL;DR: This survey analyzes 884 research papers on NLP for African languages over the past 5 years, highlighting the gap between Africa’s rich linguistic diversity and current NLP systems that predominantly support high-resource languages.

Details

Motivation: Africa has over 2,000 languages with millions of speakers, but this linguistic diversity is scarcely reflected in current NLP systems and LLMs, risking widening the digital divide across linguistic communities.

Method: Comprehensive analysis of 884 research papers on NLP for African languages published over the past five years, examining recent progress across core tasks and identifying key trends.

Result: The survey reveals active and growing NLP research on African languages, driven by factors including multilingual language resources, community-led initiatives, and increased funding support.

Conclusion: The paper outlines promising directions to foster more inclusive and sustainable NLP research for African languages, addressing the current exclusion from mainstream NLP technologies.

Abstract: With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

[116] When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy

Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza

Main category: cs.CL

TL;DR: Current Large Reasoning Models perform well on English reasoning tasks but struggle with multilingual reasoning, often reverting to English or producing fragmented reasoning in other languages, creating a trade-off between answer accuracy and reasoning trace readability.

Details

Motivation: To evaluate the multilingual reasoning capabilities of Large Reasoning Models, as this is crucial for real-world applications where users need reasoning traces in their own language for oversight purposes.

Method: Comprehensive evaluation of two leading LRM families on the XReasoning benchmark, testing prompt-based interventions and targeted post-training on 100 examples.

Result: Advanced models often use English or fragmented reasoning in other languages, revealing substantial multilingual reasoning gaps. Prompt interventions improve readability but reduce accuracy. Post-training mitigates the mismatch but some accuracy loss remains.

Conclusion: Current LRMs have limited multilingual reasoning capabilities, highlighting the need for future work to address the trade-off between answer accuracy and reasoning trace readability in different languages.

Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

[117] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma

Main category: cs.CL

TL;DR: KGQAGen is an LLM-in-the-loop framework that addresses quality issues in KGQA benchmarks by generating challenging and verifiable QA instances, resulting in a 10k-scale benchmark that exposes limitations of state-of-the-art models.

Details

Motivation: Popular KGQA datasets like WebQSP and CWQ suffer from critical quality issues including inaccurate ground-truth annotations, ambiguous questions, and outdated knowledge, with only 57% factual correctness rate found in manual audit of 16 datasets.

Method: KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to systematically resolve dataset pitfalls and produce verifiable QA instances.

Result: KGQAGen-10k benchmark was constructed and evaluation shows even state-of-the-art KG-RAG models struggle on this benchmark, highlighting its ability to expose model limitations.

Conclusion: The findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

Dayeon Ki, Kevin Duh, Marine Carpuat

Main category: cs.CL

TL;DR: This paper studies how different types of quality feedback help monolingual users make better decisions about using machine translation outputs, finding that implicit feedback methods (especially QA tables) outperform explicit feedback in improving decision accuracy and user trust.

Details

Motivation: As AI systems become more integrated into daily life, there's an urgent need for feedback mechanisms that help users, especially those who can't assess AI quality themselves, use AI responsibly.

Method: Studied a realistic machine translation scenario where monolingual users decided whether to share MT outputs, comparing four feedback types: explicit feedback (error highlights, LLM explanations) and implicit feedback (backtranslation, QA tables).

Result: All feedback types except error highlights significantly improved decision accuracy and appropriate reliance. Implicit feedback, especially QA tables, yielded significantly greater gains than explicit feedback across decision accuracy, appropriate reliance, and user perceptions.

Conclusion: Implicit feedback mechanisms, particularly QA tables, are more effective than explicit feedback for helping users make better decisions about AI outputs, receiving higher ratings for helpfulness and trust while reducing mental burden.

Abstract: As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.

[119] MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs

Gabrielle Kaili-May Liu, Gal Yona, Avi Caciularu, Idan Szpektor, Tim G. J. Rudner, Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces MetaFaith, a novel prompt-based calibration approach that improves LLMs’ ability to use linguistic expressions of uncertainty that faithfully reflect their intrinsic uncertainty, addressing the problem of LLMs using assertive language when conveying false claims.

Details

Motivation: LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. There's a need for reliable uncertainty communication through faithful confidence calibration.

Method: The study benchmarks LLMs’ ability for faithful confidence calibration across models, datasets, and prompting strategies. It introduces MetaFaith, a novel prompt-based calibration approach inspired by human metacognition.

Result: LLMs largely fail at faithful calibration, and existing interventions are insufficient. MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving 83% win rate over original generations as judged by humans.

Conclusion: MetaFaith effectively addresses the critical gap in LLMs’ faithful confidence calibration, providing a robust solution for improving uncertainty communication in language models.

Abstract: A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of $\textit{faithful confidence calibration}$ of LLMs, benchmarking models’ ability to use linguistic expressions of uncertainty that $\textit{faithfully reflect}$ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.

[120] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li

Main category: cs.CL

TL;DR: Proposes a knowledge unlearning evaluation framework using knowledge graphs and LLM judges to better assess implicit knowledge persistence in machine unlearning.

Details

Motivation: Existing machine unlearning approaches focus on explicit fact removal but overlook latent inferential dependencies and non-deterministic knowledge in LLMs, leading to incomplete unlearning.

Method: Developed a framework representing factual contexts as knowledge graphs with confidence scores, and uses LLM judges with calibrated prompts to evaluate unlearning success through inference over knowledge subgraphs.

Result: Extensive experiments show the framework provides more realistic and rigorous unlearning assessment, revealing current strategies overestimate unlearning effectiveness.

Conclusion: The proposed evaluation framework better captures implicit knowledge structure and provides more accurate assessment of machine unlearning performance in LLMs.

Abstract: Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.

[121] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin

Main category: cs.CL

TL;DR: Double-Checker is a framework that enhances LLM reasoning through iterative self-critique and refinement, improving AIME benchmark performance from 4.4% to 18.2%.

Details

Motivation: Current slow-thinking LLMs have limited ability to generate informative critiques and refine prior solutions, despite showing reflection-like reasoning.

Method: Fine-tuned on 1,730 self-critical instances to enable iterative critique and refinement during inference until solutions are deemed correct under self-generated critiques.

Result: Significantly improved reasoning capabilities across benchmarks, with AIME pass@1 performance increasing from 4.4% to 18.2%.

Conclusion: Iterative self-critique is a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the “aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker

[122] No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Dasol Choi, Woomyoung Park, Youngsook Song

Main category: cs.CL

TL;DR: This paper analyzes the fragmented landscape of NLP datasets for East Asian languages (Chinese, Japanese, Korean) in the HuggingFace ecosystem, revealing distinct patterns shaped by cultural norms and institutional practices.

Details

Motivation: Despite serving over 1.6 billion speakers, East Asian languages lack the extensive resources and analyses available for English NLP, creating a significant gap in cross-linguistic understanding of dataset development.

Method: The study employs quantitative and qualitative analysis of more than 3,300 datasets from HuggingFace, examining how cultural norms, research environments, and institutional practices shape dataset creation patterns across Chinese, Japanese, and Korean NLP communities.

Result: Findings reveal distinct patterns: large-scale institution-driven Chinese datasets, grassroots community-led Korean NLP development, and entertainment/subculture-focused Japanese collections. The analysis uncovers strategies for improving dataset documentation, licensing clarity, and cross-lingual resource sharing.

Conclusion: The study provides guidance for more effective and culturally attuned LLM development in East Asia, discussing best practices for future dataset curation and collaboration to strengthen resource development across all three languages.

Abstract: Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

[123] Reason to Rote: Rethinking Memorization in Reasoning

Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank

Main category: cs.CL

TL;DR: Language models memorize label noise but maintain reasoning capabilities because memorization builds on rather than overrides underlying reasoning mechanisms, using distributed encoding and outlier heuristics.

Details

Motivation: To understand why language models can memorize arbitrary training instances like label noise while still performing well on reasoning tasks, and investigate the relationship between memorization and reasoning mechanisms.

Method: Used two controllable synthetic reasoning datasets with noisy labels (four-digit addition and two-hop relational reasoning) to analyze how models memorize label noise and its interaction with reasoning processes.

Result: Models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. Memorization operates through distributed encoding rather than direct look-up mechanisms, using existing neuron activation patterns with slight shifts.

Conclusion: Memorization of label noise in language models builds on rather than overrides the underlying reasoning mechanisms, explaining the phenomenon of benign memorization where models can memorize noise without heavily compromising reasoning capabilities.

Abstract: Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

[124] Flexible Feature Distillation for Large Language Models

Khouloud Saadi, Di Wang

Main category: cs.CL

TL;DR: Flex-KD is a parameter-free feature distillation framework for LLMs that uses gradient-based scores to identify and distill only task-relevant dimensions of teacher representations, avoiding the need for linear projectors and improving student performance.

Details

Motivation: Current feature-level knowledge distillation methods assume identical teacher-student hidden sizes, which is unrealistic, and linear projectors introduce distortion and degrade performance. There's a need for a parameter-free approach that leverages rich internal representations without these limitations.

Method: Flex-KD uses gradient-based scores to identify the most task-relevant dimensions of teacher’s hidden states and distills only this informative subspace into the student, avoiding the need for linear projectors and additional parameters.

Result: Extensive experiments across classification and generative tasks show Flex-KD consistently boosts student performance, achieving up to 3.75% performance gain over linear projection baseline in instruction-following and summarization tasks.

Conclusion: Flex-KD provides an effective parameter-free framework for feature distillation that works with differing teacher-student hidden sizes, avoids projector-induced distortion, and significantly improves student model performance across various tasks.

Abstract: Knowledge distillation (KD) has become a cornerstone for compressing large language models (LLMs). However, existing LLM-KD methods have primarily focused on logit-based approaches, which achieve good performance but overlook the rich internal representations of LLMs. Feature-level KD could leverage this structure to provide complementary benefits, yet it remains underexplored because current feature-KD approaches typically assume identical teacher-student hidden sizes, a restrictive and unrealistic assumption. A common workaround is to train a linear projector to align their feature spaces; however, this introduces additional parameters, distorts teacher embeddings, and often degrades downstream performance, especially in generative tasks. We propose Flex-KD, a parameter-free framework for task-driven feature distillation for LLMs. Instead of projecting the entire teacher representation, Flex-KD uses gradient-based scores to identify the most task-relevant dimensions of the teacher’s hidden states and distills only this subspace into the student. This ensures that the student’s limited capacity is allocated to informative components, while avoiding projector-induced distortion and extra parameters. Flex-KD integrates seamlessly with existing KD pipelines and supports differing teacher-student hidden sizes. Extensive experiments across both classification and generative tasks, i.e., instruction-following and summarization, show that Flex-KD consistently boosts student performance, achieving up to a 3.75 percent performance gain over the linear projection baseline.

[125] Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

Main category: cs.CL

TL;DR: MCQA remains a good proxy for LLM reasoning performance when chain-of-thought is done before seeing options, but large models exploit option information when reasoning after seeing choices, inflating MCQA scores compared to free-text performance.

Details

Motivation: To investigate whether multiple-choice question-answering (MCQA) continues to be a valid proxy for evaluating LLM reasoning capabilities, especially for state-of-the-art models that might exploit the multiple-choice format.

Method: Systematic evaluation of 27 LLMs across 15 QA benchmarks with 5 different presentation formats: variations on whether choices were offered, whether “none of the above” replaced correct answers, and whether chain-of-thought reasoning occurred before/after seeing options.

Result: MCQA works well when models do chain-of-thought reasoning before seeing options. However, large models that reason after seeing options significantly outperform their free-text performance by exploiting option information, creating an inflated perception of their capabilities.

Conclusion: MCQA evaluation should be carefully interpreted, with chain-of-thought reasoning performed before presenting options to better reflect genuine reasoning capabilities rather than exploiting the multiple-choice format.

Abstract: When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI’s o3). For each model–benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether “none of the above” sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only \emph{before} being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning \emph{after} being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs’ genuine reasoning capabilities.

[126] Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Jen-tse Huang, Yi R. Fung

Main category: cs.CL

TL;DR: MultiRole-R1 is a diversity-enhanced training framework that improves subjective reasoning in Large Reasoning Models by incorporating perspective diversity and token-level diversity through unsupervised data construction and reinforcement learning with diversity rewards.

Details

Motivation: Large Reasoning Models optimized via RLVR excel at objective reasoning but degrade generation diversity, causing poor performance on subjective reasoning tasks that have multiple valid answers depending on different role perspectives.

Method: Proposes MultiRole-R1 framework with unsupervised data construction pipeline synthesizing reasoning chains with various role perspectives, and reinforcement learning via Group Relative Policy Optimization with reward shaping that includes diversity as a reward signal.

Result: Training on subjective tasks increases in-domain accuracy by 14.1% and out-of-domain accuracy by 7.64%, and even enhances performance on advanced math reasoning like AIME 2024. Diversity is shown to be a more consistent indicator of accuracy than reasoning length.

Conclusion: Introducing perspective diversity and token-level diversity significantly improves subjective reasoning capabilities in LRMs, with diversity serving as a more reliable predictor of accuracy than reasoning length.

Abstract: Large Reasoning Models (LRMs) with long chain-of-thought capabilities, optimized via reinforcement learning with verifiable rewards (RLVR), excel at objective reasoning tasks like mathematical problem solving and code generation. However, RLVR is known for degrading generation diversity, which causes LRMs to fall short on subjective reasoning that has multiple answers depending on different role perspectives. While recent studies recognize the importance of diversity-enhanced training in objective reasoning, limited attention has been given to subjective tasks. In this paper, we find that subjective reasoning can be improved by introducing perspective diversity and token-level diversity, with the former one providing a coherent scaffolding anchored to a real-world stakeholder group and the latter one broadening the answer search space. We propose MultiRole-R1, a diversity-enhanced training framework featuring an unsupervised data construction pipeline that synthesizes reasoning chains incorporating various role perspectives. It also employs reinforcement learning via Group Relative Policy Optimization with reward shaping, taking diversity as a reward signal in addition to verifiable reward. Training on subjective tasks solely, MultiRole-R1 increases the in-domain and out-of-domain accuracy by 14.1% and 7.64%, and even enhances the performance on advanced math reasoning such as AIME 2024. We further show that diversity is a more consistent indicator of accuracy than reasoning length.

[127] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta, Tung Le, Huy Tien Nguyen

Main category: cs.CL

TL;DR: Co-NAML-LSTUR is a hybrid news recommendation framework that combines multi-view news encoding with hierarchical user modeling for efficient training on limited data, achieving significant improvements over baseline methods.

Details

Motivation: To address the challenges of jointly modeling multi-view news representations and capturing dynamic, dual-scale user interests (short- and long-term preferences) in news recommendation systems, especially with limited data resources.

Method: Integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, leveraging BERT-based embeddings to enhance semantic representation.

Result: Outperforms strong baselines on MIND-small and MIND-large benchmarks, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR.

Conclusion: The hybrid model effectively combines multi-view news modeling with dual-scale user representations for practical, resource-limited scenarios, demonstrating efficiency-focused performance improvements.

Abstract: News recommendation systems play a critical role in alleviating information overload by delivering personalized content. A key challenge lies in jointly modeling multi-view representations of news articles and capturing the dynamic, dual-scale nature of user interests-encompassing both short- and long-term preferences. Prior methods often rely on single-view features or insufficiently model user behavior across time. In this work, we introduce Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, designed for training on limited data resources. Our approach leverages BERT-based embeddings to enhance semantic representation. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Results show that our model significantly outperforms strong baselines, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR. These findings highlight the effectiveness of our efficiency-focused hybrid model, which combines multi-view news modeling with dual-scale user representations for practical, resource-limited resources rather than a claim to absolute state-of-the-art (SOTA). The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR

[128] Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP

Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba

Main category: cs.CL

TL;DR: Mafoko addresses the lack of structured terminological data for South Africa’s official languages by aggregating scattered terminology resources into open, interoperable datasets, improving machine translation accuracy.

Details

Motivation: South Africa's official languages lack structured terminological data despite existing terminology lists, which remain fragmented and non-machine-readable, hindering multilingual NLP progress.

Method: Systematically aggregate, clean, and standardize scattered terminology resources into open datasets under the NOODL framework, then integrate into a Retrieval-Augmented Generation (RAG) pipeline.

Result: Experiments show substantial improvements in accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models.

Conclusion: Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s linguistic diversity is represented in the digital age.

Abstract: The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emph{Mafoko} addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emph{Mafoko} dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emph{Mafoko} provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

[129] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra, Arihant Rastogi, Agyeya Negi, Shashwat Goel, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: Models show increasing cross-lingual consistency as they grow in size and capability, with greater internal consistency across languages than agreement with other models in the same language.

Details

Motivation: To understand how similar model outputs are across different languages and evaluate multilingual reliability.

Method: Used the κ_p similarity metric applied to 20 languages and 47 subjects in GlobalMMLU dataset.

Result: Model responses become more consistent across languages with increased size and capability. Models show higher cross-lingual consistency within themselves than agreement with other models in the same language.

Conclusion: κ_p is a valuable tool for evaluating multilingual reliability and can guide development of more consistent multilingual systems.

Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model’s responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

[130] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

Main category: cs.CL

TL;DR: MOSAIC is a multilingual, taxonomy-agnostic radiology report classification system using MedGemma-4B that achieves expert-level performance while being computationally efficient and deployable on consumer GPUs.

Details

Motivation: Existing radiology report classification methods face limitations: rule-based approaches struggle with linguistic variability, supervised models need large annotated datasets, and LLM-based systems use closed-source or resource-intensive models unsuitable for clinical use. Current solutions are also limited to English and single-modality datasets.

Method: Built on compact open-access language model MedGemma-4B, MOSAIC supports zero-/few-shot prompting and lightweight fine-tuning. It’s designed to be multilingual, taxonomy-agnostic, and computationally efficient for deployment on consumer-grade GPUs (24GB memory).

Result: Achieved mean macro F1 score of 88 across five chest X-ray datasets, approaching/exceeding expert-level performance. With data augmentation, only 80 annotated samples reached weighted F1 of 82 on Danish reports (vs 86 with full 1600-sample training set). Evaluated across 7 datasets in English, Spanish, French, and Danish.

Conclusion: MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings, providing efficient multilingual radiology report classification. Code and models are open-source for community evaluation and extension to new languages, taxonomies, and modalities.

Abstract: Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

[131] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

Main category: cs.CL

TL;DR: ReLoRA’s rank-expanding approach, effective for large models, underperforms in small language models (SLMs) by amplifying rank deficiencies and causing ill-conditioned updates.

Details

Motivation: To investigate if ReLoRA can improve learning dynamics in small language models by addressing their known rank deficiencies and under-utilization of dimensionality.

Method: Systematic study of ReLoRA in SLMs (11M-66M parameters) using repeated merging and reinitialization of low-rank adapters during pretraining, evaluated on loss, Paloma perplexity, and BLiMP metrics.

Result: ReLoRA underperforms full-rank training across all metrics, with performance gaps widening at larger scales. Analysis shows it amplifies existing rank deficiencies and induces ill-conditioned updates early in training.

Conclusion: ReLoRA’s merge-and-restart strategy doesn’t translate well to capacity-limited SLMs, suggesting the need for adaptive-rank or hybrid-rank approaches for low-compute pretraining.

Abstract: Parameter-efficient methods like LoRA have revolutionised large language model (LLM) fine-tuning. ReLoRA extends this idea to pretraining by repeatedly merging and reinitialising low-rank adapters, increasing cumulative rank while keeping updates cheap. This aligns well with observations that high-capacity models learn through locally low-rank trajectories that expand over time. By contrast, recent work suggests that small language models (SLMs) exhibit rank deficiencies and under-utilise their available dimensionality. This raises a natural question: can ReLoRA’s rank-expanding update rule \textit{steer} SLMs toward healthier learning dynamics, mitigating rank bottlenecks in a capacity-constrained regime? We argue SLMs are an ideal testbed: they train quickly, enable controlled ablations, and make rank phenomena more measurable. We present the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Across loss, Paloma perplexity, and BLiMP, we find that ReLoRA underperforms full-rank training, with gaps widening at larger scales. Analysis of proportional effective rank and condition numbers shows that ReLoRA amplifies existing rank deficiencies and induces ill-conditioned updates early in training. Our results suggest that while ReLoRA’s merge-and-restart strategy can expand ranks in larger models, it does not straightforwardly translate to capacity-limited SLMs, motivating adaptive-rank or hybrid-rank approaches for low-compute pretraining.

[132] Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh

Main category: cs.CL

TL;DR: Models in multilingual RAG systems show preference for citing English sources over non-English ones, even when relevance is equal, with bias stronger for lower-resource languages and documents in middle context positions.

Details

Motivation: To investigate whether the mixture of different document languages in multilingual RAG systems impacts generation and citation behavior in unintended ways, particularly whether language preference affects citation choices beyond document relevance.

Method: Used controlled methodology with model internals to measure language preference while holding document relevance constant, testing across eight languages and six open-weight models.

Result: Models preferentially cite English sources for English queries, with bias amplified for lower-resource languages and documents positioned mid-context. Models sometimes trade document relevance for language preference in citations.

Conclusion: Citation choices in multilingual RAG systems are not always driven by informativeness alone, revealing how language models leverage multilingual context and influence citation behavior through language preferences.

Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

[133] On Code-Induced Reasoning in LLMs

Abdul Waheed, Zhen Wu, Carolyn Rosé, Daphne Ippolito

Main category: cs.CL

TL;DR: LLMs are more vulnerable to structural than semantic code perturbations, with pseudocode and flowcharts being as effective as actual code. Python favors natural language reasoning while Java/Rust favor math tasks.

Details

Motivation: To understand which aspects of code data most enhance LLM reasoning capabilities, as it remains unclear what specific properties of code are responsible for improved reasoning.

Method: Systematic framework using parallel instruction datasets in 10 programming languages with controlled perturbations that selectively disrupt structural or semantic properties. Finetuned LLMs from 5 model families and 8 scales on each variant.

Result: Across 3,331 experiments, LLMs showed greater vulnerability to structural perturbations than semantic ones, especially on math and code tasks. Pseudocode and flowcharts were equally effective as code. Corrupted code with misleading signals remained competitive when surface regularities persisted. Python favored natural language reasoning while Java/Rust favored math.

Conclusion: Structural properties of code are more critical than semantic ones for LLM reasoning. Appropriate abstractions can be as effective as actual code while using fewer tokens. Syntactic styles influence task-specific performance gains.

Abstract: Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We construct parallel instruction datasets in ten programming languages and apply controlled perturbations that selectively disrupt structural or semantic properties of code. We then finetune LLMs from five model families and eight scales on each variant and evaluate their performance on natural language, math, and code tasks. Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones, particularly on math and code tasks. Appropriate abstractions like pseudocode and flowcharts can be as effective as code, while encoding the same information with fewer tokens without adhering to original syntax can often retain or even improve performance. Remarkably, even corrupted code with misleading signals remains competitive when surface-level regularities persist. Finally, syntactic styles also shape task-specific gains with Python favoring natural language reasoning and lower-level languages such as Java and Rust favoring math. Through our systematic framework, we aim to provide insight into how different properties of code influence reasoning and inform the design of training data for enhancing LLM reasoning capabilities.

[134] The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

Dhaathri Vijay, Anandaswarup Vadapalli

Main category: cs.CL

TL;DR: Model compression strategies (distillation and quantization) can significantly reduce computational costs and environmental impact of LLMs while maintaining competitive translation quality, with minimal quality degradation.

Details

Motivation: Address concerns about computational and environmental costs of large language models by investigating trade-offs between translation quality and efficiency.

Method: Compared full-scale, distilled, and quantized models on Flores+ benchmark and human evaluations of conversational translations in French, Hindi, and Kannada.

Result: Distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% with minimal BLEU score reductions. Aggressive quantization (INT4) preserved high accuracy and fluency.

Conclusion: Model compression can substantially reduce computational demands and environmental impact while maintaining competitive quality, though trade-offs are more pronounced in low-resource settings. Need evaluation frameworks that integrate efficiency and sustainability alongside accuracy.

Abstract: The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis revealed that the full 3.3B FP32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (~ 0.007-0.008 kg CO2 per run). The distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% compared with the full model, with only minimal reductions in BLEU scores. Human evaluations further showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside accuracy as central dimensions of progress in NLP.

[135] The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)

Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad

Main category: cs.CL

TL;DR: This study analyzes the progress of African NLP research over two decades using 1.9K paper abstracts, 4.9K authors, and 7.8K annotated contribution sentences to track trends, contributions, and key players in the field.

Details

Motivation: To track the progress of NLP research and automatically analyze research paper contributions, particularly focusing on African NLP (AfricaNLP) to understand the field's evolution and researcher contributions.

Method: Quantitative examination using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) with benchmark results.

Result: Created a dataset and continuously existing NLP progress tracking website that provides insights into AfricaNLP research trends and enables data-driven literature surveys.

Conclusion: The study provides a powerful lens for tracing AfricaNLP research trends and holds potential for generating data-driven literature surveys through the developed dataset and tracking platform.

Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) basic research questions such as: i) How has the nature of NLP evolved over the last two decades?, ii) What are the contributions of AfricaNLP papers?, and iii) Which individuals and organizations (authors, affiliated institutions, and funding bodies) have been involved in the development of AfricaNLP? We quantitatively examine the contributions of AfricaNLP research using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) along with benchmark results. Our dataset and continuously existing NLP progress tracking website provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven literature surveys.

[136] Generating Difficult-to-Translate Texts

Vilém Zouhar, Wenda Xu, Parker Riley, Juraj Juraska, Mara Finkelstein, Markus Freitag, Daniel Deutsch

Main category: cs.CL

TL;DR: MT-breaker uses an LLM to iteratively refine source texts to create more challenging machine translation test cases that better reveal model weaknesses while maintaining naturalness and diversity.

Details

Motivation: Real-world machine translation benchmarks become obsolete quickly as models improve, limiting their ability to distinguish model performance and identify weaknesses. Current methods for creating difficult test cases are either ineffective at identifying truly hard examples or lack diversity and naturalness.

Method: MT-breaker employs a large language model that iteratively refines source texts by querying a target machine translation model to guide generation of difficult examples. The LLM probes the MT model to identify failure points and creates increasingly challenging translations.

Result: The approach generates examples that are more challenging for the target MT model while preserving natural text diversity. The difficulty transfers to other models and languages, even though examples are tailored to a specific model during generation.

Conclusion: MT-breaker provides an effective method for creating challenging, diverse, and natural machine translation benchmarks that better reveal model weaknesses and can generalize across different translation models and languages.

Abstract: Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark’s ability to distinguish which model is better or to reveal models’ weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.

[137] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Zhuosheng Zhang, Gongshen Liu

Main category: cs.CL

TL;DR: Agent-ScanKit is a probing framework that reveals multimodal GUI agents rely more on memorization than systematic reasoning, showing limited generalization in complex tasks.

Details

Motivation: Existing multimodal agents for GUI interaction show limited reliability on complex or out-of-domain tasks, raising concerns about whether they are reasoning spuriously rather than systematically.

Method: Proposed Agent-ScanKit framework with three orthogonal probing paradigms (visual-guided, text-guided, structure-guided) to quantify memorization vs reasoning contributions without accessing model internals.

Result: Evaluation on 5 GUI benchmarks with 18 multimodal agents shows mechanical memorization often outweighs systematic reasoning; models function mainly as retrievers of training-aligned knowledge with limited generalization.

Conclusion: Findings underscore the necessity of robust reasoning modeling for reliable multimodal agents in real-world scenarios, providing insights for developing more capable agents.

Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

[138] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

Main category: cs.CL

TL;DR: MOSS-Speech is a speech-to-speech large language model that directly processes and generates speech without text intermediates, using modality-based layer-splitting and frozen pre-training to preserve text LLM capabilities while adding native speech understanding.

Details

Motivation: Current spoken dialogue systems use cascaded pipelines that discard paralinguistic cues and limit expressivity. Even recent end-to-end methods still rely on text intermediates, creating a fundamental bottleneck for true speech-to-speech interaction.

Method: Combines modality-based layer-splitting architecture with frozen pre-training strategy to preserve reasoning and knowledge of pretrained text LLMs while adding native speech capabilities, enabling direct speech understanding and generation without text guidance.

Result: Achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance to existing text-guided systems while maintaining competitive text performance.

Conclusion: Establishes a new paradigm for expressive and efficient end-to-end speech interaction by narrowing the gap between text-guided and direct speech generation, enabling true speech-to-speech communication without text bottlenecks.

Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

[139] Tenyidie Syllabification corpus creation and deep learning applications

Teisovi Angami, Kevisino Khate

Main category: cs.CL

TL;DR: This paper presents the first syllabification work for the Tenyidie language using deep learning models, achieving 99.21% accuracy with BLSTM on a newly created dataset of 10,120 syllabified words.

Details

Motivation: Tenyidie is a low-resource Tibeto-Burman language with no prior work on syllabification, which is an important NLP task for various downstream applications.

Method: Created a corpus of 10,120 syllabified Tenyidie words and applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on an 80:10:10 train-validation-test split.

Result: Achieved highest accuracy of 99.21% with BLSTM model on the test set.

Conclusion: This work enables numerous NLP applications for Tenyidie including morphological analysis, POS tagging, and machine translation.

Abstract: The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language. Keywords: Tenyidie; NLP; syllabification; deep learning; LSTM; BLSTM; CRF; Encoder-decoder

[140] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen

Main category: cs.CL

TL;DR: RAG with foundation models shows promise for expert-level physics reasoning using a specialized multimodal dataset called PhoPile, though challenges remain.

Details

Motivation: To explore whether retrieval-augmented generation can enhance physics reasoning in foundation models for solving Olympiad-level problems, inspired by how students review past problems.

Method: Created PhoPile, a high-quality multimodal dataset for Olympiad-level physics with diagrams, graphs, and equations. Benchmark RAG-augmented foundation models (LLMs and LMMs) with multiple retrievers.

Result: Integration of retrieval with physics corpora improves model performance, but challenges in retrieval-augmented physics reasoning persist.

Conclusion: RAG shows potential for enhancing physics reasoning in foundation models, though further research is needed to address remaining challenges.

Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.

cs.CV

[141] LVTINO: LAtent Video consisTency INverse sOlver for High Definition Video Restoration

Alessio Spagnoletti, Andrés Almansa, Marcelo Pereyra

Main category: cs.CV

TL;DR: LVTINO is a zero-shot video restoration method that uses Video Consistency Models (VCMs) instead of frame-by-frame image diffusion models to achieve temporally consistent reconstructions with high computational efficiency.

Details

Motivation: Current zero-shot image restoration methods using latent diffusion models (LDMs) produce temporally inconsistent results when applied frame-by-frame to video, due to their inability to capture temporal dependencies.

Method: Leverages Video Consistency Models (VCMs) that distill video diffusion models into fast generators with explicit temporal causality. Uses a conditioning mechanism that bypasses automatic differentiation and requires only a few neural function evaluations.

Result: Achieves state-of-the-art video reconstruction quality with strong measurement consistency and smooth temporal transitions. Shows significant perceptual improvements over frame-by-frame LDM methods across diverse video inverse problems.

Conclusion: LVTINO establishes a new benchmark for zero-shot video restoration, combining high reconstruction fidelity with computational efficiency through VCM-based priors that explicitly model temporal dependencies.

Abstract: Computational imaging methods increasingly rely on powerful generative diffusion models to tackle challenging image restoration tasks. In particular, state-of-the-art zero-shot image inverse solvers leverage distilled text-to-image latent diffusion models (LDMs) to achieve unprecedented accuracy and perceptual quality with high computational efficiency. However, extending these advances to high-definition video restoration remains a significant challenge, due to the need to recover fine spatial detail while capturing subtle temporal dependencies. Consequently, methods that naively apply image-based LDM priors on a frame-by-frame basis often result in temporally inconsistent reconstructions. We address this challenge by leveraging recent advances in Video Consistency Models (VCMs), which distill video latent diffusion models into fast generators that explicitly capture temporal causality. Building on this foundation, we propose LVTINO, the first zero-shot or plug-and-play inverse solver for high definition video restoration with priors encoded by VCMs. Our conditioning mechanism bypasses the need for automatic differentiation and achieves state-of-the-art video reconstruction quality with only a few neural function evaluations, while ensuring strong measurement consistency and smooth temporal transitions across frames. Extensive experiments on a diverse set of video inverse problems show significant perceptual improvements over current state-of-the-art methods that apply image LDMs frame by frame, establishing a new benchmark in both reconstruction fidelity and computational efficiency.

[142] Image Generation Based on Image Style Extraction

Shuochen Chang

Main category: cs.CV

TL;DR: A method for fine-grained style-controlled image generation using style extraction from reference images and alignment with text representations.

Details

Motivation: Fine-grained styles are hard to describe in natural language, and stylized reference images don't align well with traditional text-guided generation methods.

Method: Three-stage training method using style encoder and style projection layer to extract style from reference images and align with text representations, without changing the generative model structure.

Result: Proposed method enables fine-grained controlled stylized image generation by injecting extracted style representations into the generative model.

Conclusion: The approach successfully achieves fine-grained style control in image generation by extracting and aligning style representations with text conditions.

Abstract: Image generation based on text-to-image generation models is a task with practical application scenarios that fine-grained styles cannot be precisely described and controlled in natural language, while the guidance information of stylized reference images is difficult to be directly aligned with the textual conditions of traditional textual guidance generation. This study focuses on how to maximize the generative capability of the pretrained generative model, by obtaining fine-grained stylistic representations from a single given stylistic reference image, and injecting the stylistic representations into the generative body without changing the structural framework of the downstream generative model, so as to achieve fine-grained controlled stylized image generation. In this study, we propose a three-stage training style extraction-based image generation method, which uses a style encoder and a style projection layer to align the style representations with the textual representations to realize fine-grained textual cue-based style guide generation. In addition, this study constructs the Style30k-captions dataset, whose samples contain a triad of images, style labels, and text descriptions, to train the style encoder and style projection layer in this experiment.

[143] EvoStruggle: A Dataset Capturing the Evolution of Struggle across Activities and Skill Levels

Shijia Feng, Michael Wray, Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: This paper introduces a dataset for temporal struggle determination in skill acquisition, featuring 61.68 hours of video recordings across 18 tasks from 4 activities, with participants repeating tasks to capture skill evolution. The problem is framed as temporal action localization, showing that struggle detection can generalize across tasks and activities.

Details

Motivation: Existing manipulation datasets haven't focused on how struggle evolves over time during skill acquisition, which is crucial for optimizing human learning and developing effective assistive systems.

Method: Collected a large dataset with 61.68 hours of video, 2,793 videos, and 5,385 annotated temporal struggle segments from 76 participants across 18 tasks in 4 activities. Participants repeated tasks 5 times to capture skill evolution. Used Temporal Action Localization models to identify and precisely localize struggle segments.

Result: Temporal Action Localization models successfully learned to detect struggle cues, achieving 34.56% average mAP when generalizing across tasks and 19.24% across activities, indicating struggle is a transferable concept across skill-based tasks.

Conclusion: Struggle can be detected and localized as a transferable concept across various skill-based tasks, though there’s room for improvement in detection performance. The dataset enables further research in understanding skill acquisition and developing assistive systems.

Abstract: The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user’s current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities – tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at https://github.com/FELIXFENG2019/EvoStruggle.

[144] SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs

Abu Bucker Siddik, Diane Oyen, Alexander Most, Michal Kucer, Ayan Biswas

Main category: cs.CV

TL;DR: SPUS is a compact foundation model using lightweight residual U-Net architecture for solving diverse PDEs, achieving SOTA generalization with fewer parameters and minimal fine-tuning.

Details

Motivation: To create an efficient foundation model for PDE solving that avoids the computational overhead of large transformer architectures used in existing state-of-the-art PDE foundation models.

Method: Uses a lightweight residual U-Net-based architecture with auto-regressive pretraining strategy that replicates numerical solver behavior, pretrained on diverse fluid dynamics PDEs.

Result: Achieves state-of-the-art generalization on 6 challenging unseen downstream PDEs across various physical systems with significantly fewer parameters and minimal fine-tuning data.

Conclusion: SPUS demonstrates potential as a highly parameter-efficient foundation model for solving diverse PDE systems, making U-Net architecture a viable alternative to complex transformer models in this domain.

Abstract: We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.

[145] DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation

Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

Main category: cs.CV

TL;DR: DisCo is a reinforcement learning framework that optimizes identity diversity in multi-human image generation, solving face duplication and identity merging issues in text-to-image models.

Details

Motivation: Current text-to-image models fail at multi-human prompts by duplicating faces, merging identities, and miscounting individuals, creating an identity crisis in generative models.

Method: Uses Group-Relative Policy Optimization (GRPO) with a compositional reward that penalizes facial similarity, discourages identity repetition, enforces accurate person counts, and preserves visual fidelity. Includes a single-stage curriculum for stable training.

Result: Achieves 98.6% Unique Face Accuracy and near-perfect Global Identity Spread on DiverseHumans Testset, surpassing both open-source and proprietary methods while maintaining competitive perceptual quality.

Conclusion: DisCo establishes a scalable, annotation-free solution that resolves identity issues in generative models and sets a new benchmark for compositional multi-human generation.

Abstract: State-of-the-art text-to-image models excel at realism but collapse on multi-human prompts - duplicating faces, merging identities, and miscounting individuals. We introduce DisCo (Reinforcement with Diversity Constraints), the first RL-based framework to directly optimize identity diversity in multi-human generation. DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization (GRPO) with a compositional reward that (i) penalizes intra-image facial similarity, (ii) discourages cross-sample identity repetition, (iii) enforces accurate person counts, and (iv) preserves visual fidelity through human preference scores. A single-stage curriculum stabilizes training as complexity scales, requiring no extra annotations. On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread - surpassing both open-source and proprietary methods (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish DisCo as a scalable, annotation-free solution that resolves the long-standing identity crisis in generative models and sets a new benchmark for compositional multi-human generation.

[146] GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings

Angel Daruna, Nicholas Meegan, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar

Main category: cs.CV

TL;DR: The paper presents a novel visual geo-localization method that aligns visual representations with hierarchical geographic embeddings, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: To improve worldwide visual geo-localization by developing better learned representations of geography that can effectively align with visual content from query images.

Method: Formulates geo-localization as aligning visual representations with learned geographic representations, using hierarchical geographic embeddings and fusing appearance features with semantic segmentation maps.

Result: Achieved improved all-time bests in 22 out of 25 metrics across five benchmark datasets, outperforming prior state-of-the-art methods and recent Large Vision-Language Models.

Conclusion: The combination of geographic and visual representations is the primary driver for the performance gains in visual geo-localization tasks.

Abstract: Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.

Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman

Main category: cs.CV

TL;DR: XMAS is the first principled method for data-efficient instruction tuning of Large Vision-Language Models (LVLMs) that clusters examples based on cross-modal attention matrix trajectories to remove redundancy in training data.

Details

Motivation: Data selection has been extensively explored for vision models and LLMs but remains underexplored for LVLMs, with no existing methods outperforming random selection at different subset sizes.

Method: XMAS clusters examples based on trajectories of top singular values of attention matrices from fine-tuning a small proxy LVLM, then samples a balanced subset from these clusters to remove redundancy.

Result: XMAS can discard 50% of LLaVA-665k dataset and 85% of Vision-Flan dataset while fully preserving LLaVA-1.5-7B performance on 10 benchmarks and speeding up training by 1.2x, achieving 30% more data reduction than best baseline.

Conclusion: XMAS provides an effective approach for data-efficient LVLM instruction tuning by leveraging cross-modal attention patterns to identify and eliminate redundant training examples.

Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project’s website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

[148] Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom

Main category: cs.CV

TL;DR: Purrception is a variational flow matching method for vector-quantized image generation that combines continuous transport dynamics with explicit categorical supervision over codebook indices.

Details

Motivation: To bridge the gap between continuous flow matching methods (which have geometric awareness) and discrete categorical approaches (which provide explicit supervision), enabling uncertainty quantification and temperature-controlled generation.

Method: Adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space.

Result: Training converges faster than both continuous flow matching and discrete flow matching baselines, achieving competitive FID scores on ImageNet-1k 256x256 generation compared to state-of-the-art models.

Conclusion: Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

Abstract: We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

[149] AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging

Yuxuan Ou, Ning Bi, Jiazhen Pan, Jiancheng Yang, Boliang Yu, Usama Zidan, Regent Lee, Vicente Grau

Main category: cs.CV

TL;DR: A unified deep learning framework that generates synthetic contrast-enhanced CT images from non-contrast CT scans while simultaneously segmenting aortic lumen and thrombus, outperforming single-task and multi-stage models.

Details

Motivation: To reduce risks associated with iodinated contrast agents (nephrotoxicity, allergies, environmental harm) by generating synthetic CECT from NCCT scans, addressing limitations of multi-stage pipelines that cause error accumulation.

Method: Integrates conditional diffusion models with multi-task learning for end-to-end joint optimization of image synthesis and segmentation, sharing encoder/decoder parameters across tasks and using semi-supervised training for missing labels.

Result: Achieved PSNR of 25.61 dB for image synthesis (vs 23.80 dB single-task), improved lumen Dice to 0.89 (from 0.87) and thrombus Dice to 0.53 (from 0.48), reduced lumen diameter MAE to 4.19 mm (from 5.78 mm) and thrombus area error to 33.85% (from 41.45%).

Conclusion: The proposed unified framework effectively reduces contrast agent use while improving both image synthesis quality and anatomical segmentation accuracy for AAA assessment.

Abstract: While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at https://github.com/yuxuanou623/AortaDiff.git.

[150] From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding

Basem Rizk, Joel Walsh, Mark Core, Benjamin Nye

Main category: cs.CV

TL;DR: A framework for efficiently prototyping multi-modal video analysis pipelines using pre-trained models to convert videos into temporal semi-structured data and queryable knowledge graphs.

Details

Motivation: Multi-modal content analysis is complex, computationally expensive, and requires significant engineering effort. Existing pre-trained models are available but challenging to fuse with complex data like videos.

Method: Develop a framework that creates pipelines marrying pre-trained models to convert videos into temporal semi-structured data, then translate this to frame-level indexed knowledge graphs that are queryable and support continual learning.

Result: The framework enables dynamic incorporation of new domain-specific knowledge through an interactive medium and supports efficient prototyping of multi-modal analysis pipelines.

Conclusion: The presented framework successfully addresses the challenges of multi-modal video analysis by providing an efficient prototyping approach that leverages pre-trained models and enables queryable knowledge representation with continual learning capabilities.

Abstract: Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.

[151] WALT: Web Agents that Learn Tools

Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu

Main category: cs.CV

TL;DR: WALT is a framework that reverse-engineers website functionality into reusable tools, enabling web agents to perform high-level operations like search and filter instead of brittle step-by-step UI interactions.

Details

Motivation: Current web agents are brittle and rely on step-by-step UI interactions with heavy LLM reasoning, which breaks under dynamic layouts and long horizons. Humans use high-level website operations, so WALT aims to expose these latent functionalities as tools.

Method: WALT reverse-engineers latent website functionality into reusable invocable tools spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Agents call tools like search(query) instead of reasoning about low-level interactions.

Result: On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning compared to traditional methods.

Conclusion: WALT establishes a robust and generalizable paradigm for browser automation by shifting computational burden from fragile step-by-step reasoning to reliable tool invocation.

Abstract: Web agents promise to automate complex browser tasks, but current methods remain brittle – relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites – spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

[152] MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen

Main category: cs.CV

TL;DR: A semi-supervised segmentation framework that uses topological consistency across multiple predictions to preserve meaningful structures in histopathology images, with a novel matching strategy for feature alignment.

Details

Motivation: To address the challenge of capturing meaningful semantic structures from unlabeled data in histopathology image analysis, where objects are densely distributed and distinguishing biological structures from noise is difficult.

Method: Leverages multiple perturbed predictions through stochastic dropouts and temporal training snapshots, enforcing topological consistency across outputs. Introduces a novel matching strategy that integrates spatial overlap with global structural alignment to match topological features without ground truth.

Result: Extensive experiments show the approach effectively reduces topological errors and produces more robust and accurate segmentations essential for reliable downstream analysis.

Conclusion: The proposed framework successfully identifies and preserves relevant topological features in semi-supervised segmentation, particularly beneficial for histopathology image analysis where biological structures need to be distinguished from artifacts.

Abstract: In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at \href{https://github.com/Melon-Xu/MATCH}{https://github.com/Melon-Xu/MATCH}.

[153] Towards Better Optimization For Listwise Preference in Diffusion Models

Jiamu Bai, Xin Yu, Meilong Xu, Weitao Lu, Xin Pan, Kiwan Maeng, Daniel Kifer, Jian Wang, Yu Wang

Main category: cs.CV

TL;DR: Diffusion-LPO extends Direct Preference Optimization (DPO) to handle listwise preferences in diffusion models, using ranked image lists to better capture human preferences than pairwise comparisons.

Details

Motivation: Current DPO applications in diffusion models rely mainly on pairwise preferences, but human feedback often contains implicit ranked information that provides more precise preference data than pairwise comparisons.

Method: Proposes Diffusion-LPO framework that aggregates user feedback into ranked image lists and derives a listwise extension of DPO objective under the Plackett-Luce model, enforcing consistency across entire rankings.

Result: Diffusion-LPO consistently outperforms pairwise DPO baselines across text-to-image generation, image editing, and personalized preference alignment tasks in terms of visual quality and preference alignment.

Conclusion: Listwise preference optimization through Diffusion-LPO provides more effective alignment of diffusion models with human preferences compared to pairwise approaches.

Abstract: Reinforcement learning from human feedback (RLHF) has proven effectiveness for aligning text-to-image (T2I) diffusion models with human preferences. Although Direct Preference Optimization (DPO) is widely adopted for its computational efficiency and avoidance of explicit reward modeling, its applications to diffusion models have primarily relied on pairwise preferences. The precise optimization of listwise preferences remains largely unaddressed. In practice, human feedback on image preferences often contains implicit ranked information, which conveys more precise human preferences than pairwise comparisons. In this work, we propose Diffusion-LPO, a simple and effective framework for Listwise Preference Optimization in diffusion models with listwise data. Given a caption, we aggregate user feedback into a ranked list of images and derive a listwise extension of the DPO objective under the Plackett-Luce model. Diffusion-LPO enforces consistency across the entire ranking by encouraging each sample to be preferred over all of its lower-ranked alternatives. We empirically demonstrate the effectiveness of Diffusion-LPO across various tasks, including text-to-image generation, image editing, and personalized preference alignment. Diffusion-LPO consistently outperforms pairwise DPO baselines on visual quality and preference alignment.

[154] Growing Visual Generative Capacity for Pre-Trained MLLMs

Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, Hao Chen

Main category: cs.CV

TL;DR: Bridge is a pure autoregressive unified MLLM that enables both image understanding and generation within a single next-token prediction framework using a Mixture-of-Transformers architecture and semantic-to-pixel discrete representation.

Details

Motivation: Current unified MLLMs face challenges: hybrid approaches break the autoregressive paradigm, while pure autoregressive approaches trade off between semantic alignment and pixel-level fidelity.

Method: Augments pre-trained visual understanding models with generative ability through Mixture-of-Transformers architecture and semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.

Result: Achieves competitive or superior results in both understanding and generation benchmarks with only 7.9% increase in sequence length, requiring less training data and reduced training time.

Conclusion: Bridge successfully demonstrates that pure autoregressive unified MLLMs can achieve strong performance in both understanding and generation tasks while maintaining efficiency.

Abstract: Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models remains challenging: hybrid approaches combine continuous embeddings with diffusion or flow-based objectives, producing high-quality images but breaking the autoregressive paradigm, while pure autoregressive approaches unify text and image prediction over discrete visual tokens but often face trade-offs between semantic alignment and pixel-level fidelity. In this work, we present Bridge, a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability through a Mixture-of-Transformers architecture, enabling both image understanding and generation within a single next-token prediction framework. To further improve visual generation fidelity, we propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens, achieving strong language alignment and precise description of visual details with only a 7.9% increase in sequence length. Extensive experiments across diverse multimodal benchmarks demonstrate that Bridge achieves competitive or superior results in both understanding and generation benchmarks, while requiring less training data and reduced training time compared to prior unified MLLMs.

[155] Robust Classification of Oral Cancer with Limited Training Data

Akshay Bhagwan Sonawane, Lena D. Swamikannan, Lakshman Tamil

Main category: cs.CV

TL;DR: A hybrid CNN-Bayesian deep learning model for oral cancer classification that uses variational inference for uncertainty quantification, achieving better generalizability than traditional CNNs on small datasets.

Details

Motivation: Oral cancer has high mortality rates, especially in regions with limited healthcare access. Early diagnosis is crucial but challenging due to limited infrastructure and healthcare practitioners. Traditional deep learning models are overconfident and require large datasets, which is unrealistic in data-scarce settings.

Method: Proposed a hybrid model combining convolutional neural network (CNN) with Bayesian deep learning using variational inference for uncertainty quantification. Trained on smartphone-captured photographic color images with small training sets.

Result: Achieved 94% accuracy on similar distribution test data (comparable to traditional CNN). Superior generalizability on real-world photographic data with 88% accuracy vs 72.94% for traditional CNNs. Model shows low uncertainty for correct classifications and high uncertainty for misclassifications.

Conclusion: Bayesian inference effectively enhances model reliability and generalizability in data-scarce environments, improving early oral cancer diagnosis capabilities.

Abstract: Oral cancer ranks among the most prevalent cancers globally, with a particularly high mortality rate in regions lacking adequate healthcare access. Early diagnosis is crucial for reducing mortality; however, challenges persist due to limited oral health programs, inadequate infrastructure, and a shortage of healthcare practitioners. Conventional deep learning models, while promising, often rely on point estimates, leading to overconfidence and reduced reliability. Critically, these models require large datasets to mitigate overfitting and ensure generalizability, an unrealistic demand in settings with limited training data. To address these issues, we propose a hybrid model that combines a convolutional neural network (CNN) with Bayesian deep learning for oral cancer classification using small training sets. This approach employs variational inference to enhance reliability through uncertainty quantification. The model was trained on photographic color images captured by smartphones and evaluated on three distinct test datasets. The proposed method achieved 94% accuracy on a test dataset with a distribution similar to that of the training data, comparable to traditional CNN performance. Notably, for real-world photographic image data, despite limitations and variations differing from the training dataset, the proposed model demonstrated superior generalizability, achieving 88% accuracy on diverse datasets compared to 72.94% for traditional CNNs, even with a smaller dataset. Confidence analysis revealed that the model exhibits low uncertainty (high confidence) for correctly classified samples and high uncertainty (low confidence) for misclassified samples. These results underscore the effectiveness of Bayesian inference in data-scarce environments in enhancing early oral cancer diagnosis by improving model reliability and generalizability.

[156] Consistent Assistant Domains Transformer for Source-free Domain Adaptation

Renrong Shao, Wei Zhang, Kangyang Luo, Qin Li, and Jun Wang

Main category: cs.CV

TL;DR: CADTrans proposes a novel source-free domain adaptation method that constructs invariable feature representations using assistant domains and multiple consistency strategies to address hard samples and domain bias.

Details

Motivation: Current SFDA methods struggle with hard samples and domain bias because they can't access source domain data to obtain deterministic invariable features. Existing approaches focus on aligning target domain with source domain but are limited in representing diversity.

Method: Uses assistant domain module to obtain diversified representations from intermediate aggregated global attentions. Employs multiple consistent strategies to obtain invariable feature representations that distinguish easy/hard samples. Uses conditional multi-kernel max mean discrepancy (CMK-MMD) to align hard samples to corresponding easy samples.

Result: Extensive experiments on Office-31, Office-Home, VISDA-C, and DomainNet-126 benchmarks show significant performance improvements over existing methods.

Conclusion: CADTrans effectively addresses SFDA challenges by constructing domain-consistent invariable feature representations through assistant domains and multiple consistency strategies, achieving superior performance across multiple benchmarks.

Abstract: Source-free domain adaptation (SFDA) aims to address the challenge of adapting to a target domain without accessing the source domain directly. However, due to the inaccessibility of source domain data, deterministic invariable features cannot be obtained. Current mainstream methods primarily focus on evaluating invariant features in the target domain that closely resemble those in the source domain, subsequently aligning the target domain with the source domain. However, these methods are susceptible to hard samples and influenced by domain bias. In this paper, we propose a Consistent Assistant Domains Transformer for SFDA, abbreviated as CADTrans, which solves the issue by constructing invariable feature representations of domain consistency. Concretely, we develop an assistant domain module for CADTrans to obtain diversified representations from the intermediate aggregated global attentions, which addresses the limitation of existing methods in adequately representing diversity. Based on assistant and target domains, invariable feature representations are obtained by multiple consistent strategies, which can be used to distinguish easy and hard samples. Finally, to align the hard samples to the corresponding easy samples, we construct a conditional multi-kernel max mean discrepancy (CMK-MMD) strategy to distinguish between samples of the same category and those of different categories. Extensive experiments are conducted on various benchmarks such as Office-31, Office-Home, VISDA-C, and DomainNet-126, proving the significant performance improvements achieved by our proposed approaches. Code is available at https://github.com/RoryShao/CADTrans.git.

[157] Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Ricardo Gonzalez Penuela, Felipe Arias-Russi, Victor Capriles

Main category: cs.CV

TL;DR: A system that uses historical BLV user questions to guide MLLMs in generating context-aware image descriptions, improving relevance for blind and low vision users.

Details

Motivation: Current MLLM applications provide lengthy, comprehensive descriptions regardless of context, leading to inefficient exchanges where users must filter through irrelevant details instead of getting specific information they need.

Method: Developed a system that identifies similar past visual contexts from VizWiz-LF dataset and uses associated historical BLV user questions to guide MLLM in generating more relevant descriptions.

Result: Context-aware descriptions anticipated and answered users’ questions in 76.1% of cases (70/92) and were preferred in 54.4% of comparisons (50/92) over context-free descriptions.

Conclusion: Using historical BLV user questions to guide MLLMs significantly improves the relevance and usefulness of image descriptions for blind and low vision users.

Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users’ questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at https://github.com/rgonzalezp/guiding-multimodal-large-language-models-with-blind-and-low-vision-people-visual-questions .

[158] ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Krishna Teja Chitty-Venkata, Murali Emani

Main category: cs.CV

TL;DR: ImageNet-Think is a multimodal reasoning dataset with 250,000 images from ImageNet21k, featuring structured thinking tokens and answers generated by state-of-the-art VLMs to train reasoning-capable vision language models.

Details

Motivation: To develop Vision Language Models (VLMs) with explicit reasoning capabilities and contribute to understanding multimodal reasoning mechanisms.

Method: Synthetic dataset generation using two advanced VLMs (GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506) on 250,000 ImageNet21k images, capturing step-by-step reasoning processes and final answers.

Result: Created a comprehensive dataset with structured thinking tokens and corresponding answers, providing two thinking-answer sequence pairs per image for training and evaluation.

Conclusion: The dataset will be publicly available to support research in reasoning/thinking multimodal VLMs and enable development of more robust vision language models.

Abstract: We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

[159] NPN: Non-Linear Projections of the Null-Space for Imaging Inverse Problems

Roman Jacome, Romario Gualdrón-Hurtado, Leon Suarez, Henry Arguello

Main category: cs.CV

TL;DR: NPN is a novel regularization method that uses neural networks to project solutions into low-dimensional null-space structures of sensing matrices, improving interpretability and flexibility in imaging inverse problems.

Details

Motivation: Traditional priors ignore the task-specific structure of the null-space in imaging inverse problems, leading to suboptimal reconstructions from undersampled, noisy measurements.

Method: Proposes Non-Linear Projections of the Null-Space (NPN) that promotes solutions in low-dimensional projections of the sensing matrix’s null-space using neural networks, focusing on information orthogonal to measurable signal components.

Result: NPN priors consistently enhance reconstruction fidelity across various imaging inverse problems including compressive sensing, deblurring, super-resolution, CT, and MRI, working with multiple reconstruction frameworks.

Conclusion: NPN provides interpretable, flexible regularization that captures null-space structure, offering theoretical guarantees and empirical improvements across diverse inverse problems and reconstruction methods.

Abstract: Imaging inverse problems aims to recover high-dimensional signals from undersampled, noisy measurements, a fundamentally ill-posed task with infinite solutions in the null-space of the sensing operator. To resolve this ambiguity, prior information is typically incorporated through handcrafted regularizers or learned models that constrain the solution space. However, these priors typically ignore the task-specific structure of that null-space. In this work, we propose \textit{Non-Linear Projections of the Null-Space} (NPN), a novel class of regularization that, instead of enforcing structural constraints in the image domain, promotes solutions that lie in a low-dimensional projection of the sensing matrix’s null-space with a neural network. Our approach has two key advantages: (1) Interpretability: by focusing on the structure of the null-space, we design sensing-matrix-specific priors that capture information orthogonal to the signal components that are fundamentally blind to the sensing process. (2) Flexibility: NPN is adaptable to various inverse problems, compatible with existing reconstruction frameworks, and complementary to conventional image-domain priors. We provide theoretical guarantees on convergence and reconstruction accuracy when used within plug-and-play methods. Empirical results across diverse sensing matrices demonstrate that NPN priors consistently enhance reconstruction fidelity in various imaging inverse problems, such as compressive sensing, deblurring, super-resolution, computed tomography, and magnetic resonance imaging, with plug-and-play methods, unrolling networks, deep image prior, and diffusion models.

[160] Automated Genomic Interpretation via Concept Bottleneck Models for Medical Robotics

Zijun Li, Jinchang Zhang, Ming Zhang, Guoyu Lu

Main category: cs.CV

TL;DR: An automated genomic interpretation system that transforms DNA sequences into actionable decisions using Chaos Game Representation and Concept Bottleneck Model, achieving state-of-the-art HIV subtype classification with interpretable evidence and cost-aware recommendations.

Details

Motivation: To bridge the gap between interpretable genomic modeling and automated decision-making for medical automation and robotic systems, enabling reliable genomic medicine applications.

Method: Combines Chaos Game Representation (CGR) with Concept Bottleneck Model (CBM) using biologically meaningful concepts (GC content, CpG density, k-mer motifs), with concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration.

Result: Achieves state-of-the-art HIV subtype classification across in-house and LANL datasets, superior concept prediction fidelity, and favorable cost-benefit trade-offs compared to existing baselines.

Conclusion: The proposed system establishes a reliable foundation for robotic and clinical automation in genomic medicine by delivering interpretable evidence and cost-aware decision policies that balance accuracy, calibration, and clinical utility.

Abstract: We propose an automated genomic interpretation module that transforms raw DNA sequences into actionable, interpretable decisions suitable for integration into medical automation and robotic systems. Our framework combines Chaos Game Representation (CGR) with a Concept Bottleneck Model (CBM), enforcing predictions to flow through biologically meaningful concepts such as GC content, CpG density, and k mer motifs. To enhance reliability, we incorporate concept fidelity supervision, prior consistency alignment, KL distribution matching, and uncertainty calibration. Beyond accurate classification of HIV subtypes across both in-house and LANL datasets, our module delivers interpretable evidence that can be directly validated against biological priors. A cost aware recommendation layer further translates predictive outputs into decision policies that balance accuracy, calibration, and clinical utility, reducing unnecessary retests and improving efficiency. Extensive experiments demonstrate that the proposed system achieves state of the art classification performance, superior concept prediction fidelity, and more favorable cost benefit trade-offs compared to existing baselines. By bridging the gap between interpretable genomic modeling and automated decision-making, this work establishes a reliable foundation for robotic and clinical automation in genomic medicine.

[161] VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, Zheng Zhu

Main category: cs.CV

TL;DR: VLA-R1 is a reasoning-enhanced Vision-Language-Action model that integrates Reinforcement Learning from Verifiable Rewards with Group Relative Policy Optimization to systematically optimize both reasoning and execution, achieving superior generalization and real-world performance.

Details

Motivation: Current VLA models lack explicit step-by-step reasoning, emit final actions without considering affordance constraints or geometric relations, and have post-training pipelines that rarely reinforce reasoning quality with weak reward design.

Method: Integrates RLVR with GRPO for systematic optimization of reasoning and execution, designs RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, and develops VLA-CoT-13K dataset with chain-of-thought supervision aligned with affordance and trajectory annotations.

Result: Extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate superior generalization and real-world performance compared to prior VLA methods.

Conclusion: VLA-R1 addresses key limitations in current VLA models by enhancing reasoning capabilities through verifiable reward optimization and high-quality chain-of-thought supervision, achieving improved performance across diverse evaluation settings.

Abstract: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

[162] Joint Deblurring and 3D Reconstruction for Macrophotography

Yifan Zhao, Liangchen Li, Yuqi Zhou, Kai Wang, Yan Liang, Juyong Zhang

Main category: cs.CV

TL;DR: Proposes a joint deblurring and 3D reconstruction method for macrophotography that simultaneously optimizes clear 3D models and defocus blur kernels using differentiable rendering.

Details

Motivation: Defocus blur in macrophotography hinders clear imaging and high-quality 3D reconstruction, while traditional methods require many images and annotations with no existing multi-view 3D reconstruction for macrophotography.

Method: Joint optimization of clear 3D model and defocus blur kernel for each pixel using differentiable rendering for self-supervised optimization from multi-view blurry images.

Result: Achieves high-quality image deblurring and recovers high-fidelity 3D appearance from a small number of multi-view images.

Conclusion: The proposed joint framework effectively addresses defocus blur in macrophotography and enables high-quality 3D reconstruction from limited multi-view images.

Abstract: Macro lens has the advantages of high resolution and large magnification, and 3D modeling of small and detailed objects can provide richer information. However, defocus blur in macrophotography is a long-standing problem that heavily hinders the clear imaging of the captured objects and high-quality 3D reconstruction of them. Traditional image deblurring methods require a large number of images and annotations, and there is currently no multi-view 3D reconstruction method for macrophotography. In this work, we propose a joint deblurring and 3D reconstruction method for macrophotography. Starting from multi-view blurry images captured, we jointly optimize the clear 3D model of the object and the defocus blur kernel of each pixel. The entire framework adopts a differentiable rendering method to self-supervise the optimization of the 3D model and the defocus blur kernel. Extensive experiments show that from a small number of multi-view images, our proposed method can not only achieve high-quality image deblurring but also recover high-fidelity 3D appearance.

[163] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring

Xiaoyang Liu, Zhengyan Zhou, Zihang Xu, Jiezhang Cao, Zheng Chen, Yulun Zhang

Main category: cs.CV

TL;DR: FideDiff is a novel single-step diffusion model for high-fidelity motion deblurring that reformulates deblurring as a diffusion process, trains a consistency model to align all timesteps to clean images, and achieves superior performance with one-step inference.

Details

Motivation: To address the limitations of current diffusion models for image deblurring, including unbearable inference time and compromised fidelity, while leveraging the strong generative capabilities of pre-trained diffusion models.

Method: Reformulates motion deblurring as a diffusion-like process with progressively blurred images, trains a consistency model with matched blur trajectories, integrates Kernel ControlNet for blur kernel estimation, and introduces adaptive timestep prediction.

Result: Achieves superior performance on full-reference metrics, surpasses previous diffusion-based methods, and matches the performance of other state-of-the-art models with efficient one-step deblurring.

Conclusion: FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks and establishes a robust baseline for advancing diffusion models in real-world industrial applications.

Abstract: Recent advancements in image motion deblurring, driven by CNNs and transformers, have made significant progress. Large-scale pre-trained diffusion models, which are rich in true-world modeling, have shown great promise for high-quality image restoration tasks such as deblurring, demonstrating stronger generative capabilities than CNN and transformer-based methods. However, challenges such as unbearable inference time and compromised fidelity still limit the full potential of the diffusion models. To address this, we introduce FideDiff, a novel single-step diffusion model designed for high-fidelity deblurring. We reformulate motion deblurring as a diffusion-like process where each timestep represents a progressively blurred image, and we train a consistency model that aligns all timesteps to the same clean image. By reconstructing training data with matched blur trajectories, the model learns temporal consistency, enabling accurate one-step deblurring. We further enhance model performance by integrating Kernel ControlNet for blur kernel estimation and introducing adaptive timestep prediction. Our model achieves superior performance on full-reference metrics, surpassing previous diffusion-based methods and matching the performance of other state-of-the-art models. FideDiff offers a new direction for applying pre-trained diffusion models to high-fidelity image restoration tasks, establishing a robust baseline for further advancing diffusion models in real-world industrial applications. Our dataset and code will be available at https://github.com/xyLiu339/FideDiff.

[164] LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition

Rixin Zhou, Peiqiang Qiu, Qian Zhang, Chuntao Li, Xi Yang

Main category: cs.CV

TL;DR: This paper presents a large-scale bronze inscription dataset and a two-stage detection-recognition pipeline with LadderMoE to address challenges in automatic recognition of ancient Chinese bronze inscriptions.

Details

Motivation: Bronze inscriptions are crucial for archaeological studies but automatic recognition is difficult due to severe visual degradation, multi-domain variability across photographs/rubbings/tracings, and extremely long-tailed character distribution.

Method: A two-stage detection-recognition pipeline that first localizes inscriptions then transcribes characters, equipped with LadderMoE which augments a pretrained CLIP encoder with ladder-style MoE adapters for dynamic expert specialization and robustness.

Result: The method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities (photographs, rubbings, tracings).

Conclusion: The approach establishes a strong foundation for bronze inscription recognition and downstream archaeological analysis, successfully addressing the challenges of multi-domain variability and long-tailed character distribution.

Abstract: Bronze inscriptions (BI), engraved on ritual vessels, constitute a crucial stage of early Chinese writing and provide indispensable evidence for archaeological and historical studies. However, automatic BI recognition remains difficult due to severe visual degradation, multi-domain variability across photographs, rubbings, and tracings, and an extremely long-tailed character distribution. To address these challenges, we curate a large-scale BI dataset comprising 22454 full-page images and 198598 annotated characters spanning 6658 unique categories, enabling robust cross-domain evaluation. Building on this resource, we develop a two-stage detection-recognition pipeline that first localizes inscriptions and then transcribes individual characters. To handle heterogeneous domains and rare classes, we equip the pipeline with LadderMoE, which augments a pretrained CLIP encoder with ladder-style MoE adapters, enabling dynamic expert specialization and stronger robustness. Comprehensive experiments on single-character and full-page recognition tasks demonstrate that our method substantially outperforms state-of-the-art scene text recognition baselines, achieving superior accuracy across head, mid, and tail categories as well as all acquisition modalities. These results establish a strong foundation for bronze inscription recognition and downstream archaeological analysis.

[165] VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen, Dat Nguyen

Main category: cs.CV

TL;DR: VirDA proposes a parameter-efficient domain adaptation method using visual reprogramming layers instead of fine-tuning full backbones, achieving competitive accuracy with significantly fewer trainable parameters.

Details

Motivation: Existing UDA methods require fine-tuning backbone parameters for each new source-target pair, leading to linear growth in parameters and storage, and preventing backbone reuse.

Method: Prepends domain-specific visual reprogramming layers to the backbone that produce visual prompts as textural bias, optimizing with multiple objective functions for intra- and inter-domain distribution differences without modifying backbone parameters.

Result: Achieves 92.8% mean accuracy on Office-31 with only 1.5M trainable parameters, outperforming PDA by +1.6% accuracy using 46% of parameters, and outperforming CDTrans and FixBi while using only 1.7% and 2.8% of their parameters respectively.

Conclusion: VirDA enables efficient domain adaptation by reusing backbone parameters across domains through visual reprogramming, achieving strong performance with dramatically reduced parameter requirements.

Abstract: Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA.Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its ``style’’ to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

[166] Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery

Minh Tran, Maksim Siniukov, Zhangyu Jin, Mohammad Soleymani

Main category: cs.CV

TL;DR: DFE is an unsupervised facial expression encoding method that learns a discrete dictionary of facial expressions from 3D mesh sequences using RVQ-VAE, outperforming FACS and other methods in psychological tasks.

Details

Motivation: Existing facial expression coding systems like FACS have limited coverage and require costly manual annotation, creating a need for scalable, data-driven alternatives.

Method: Extract identity-invariant expression features using 3DMM, then encode them with Residual Vector Quantized VAE to produce discrete tokens from a shared codebook that capture reusable facial deformation patterns.

Result: DFE captures more precise facial behaviors than FACS and outperforms FACS-based pipelines and representation learning models in stress detection, personality prediction, and depression detection tasks.

Conclusion: DFE serves as a scalable and effective alternative to FACS for psychological and affective computing applications, covering a wider variety of facial displays.

Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

[167] Non-Rigid Structure-from-Motion via Differential Geometry with Recoverable Conformal Scale

Yongbo Chen, Yanhao Zhang, Shaifali Parashar, Liang Zhao, Shoudong Huang

Main category: cs.CV

TL;DR: Con-NRSfM is a novel method for non-rigid structure-from-motion that handles conformal deformations, eliminates strict assumptions, accurately computes local conformal scale, and enables precise depth estimation through a graph-based framework with parallel optimization.

Details

Motivation: To address limitations in existing NRSfM methods that rely on strict assumptions like locally planar surfaces or locally linear deformations, and fail to recover conformal scale, while improving monocular visual deformable SLAM.

Method: Uses point-wise reconstruction with 2D selected image warps optimized through graph-based framework, decouples depth and conformal scale constraints, employs parallel separable iterative optimization, and incorporates self-supervised learning with encoder-decoder network for dense 3D point clouds.

Result: Method surpasses existing approaches in reconstruction accuracy and robustness on both synthetic and real datasets, demonstrating improved performance in conformal deformation handling.

Conclusion: Con-NRSfM provides an effective solution for NRSfM under conformal deformations with superior accuracy and robustness compared to existing methods, with code being made publicly available.

Abstract: Non-rigid structure-from-motion (NRSfM), a promising technique for addressing the mapping challenges in monocular visual deformable simultaneous localization and mapping (SLAM), has attracted growing attention. We introduce a novel method, called Con-NRSfM, for NRSfM under conformal deformations, encompassing isometric deformations as a subset. Our approach performs point-wise reconstruction using 2D selected image warps optimized through a graph-based framework. Unlike existing methods that rely on strict assumptions, such as locally planar surfaces or locally linear deformations, and fail to recover the conformal scale, our method eliminates these constraints and accurately computes the local conformal scale. Additionally, our framework decouples constraints on depth and conformal scale, which are inseparable in other approaches, enabling more precise depth estimation. To address the sensitivity of the formulated problem, we employ a parallel separable iterative optimization strategy. Furthermore, a self-supervised learning framework, utilizing an encoder-decoder network, is incorporated to generate dense 3D point clouds with texture. Simulation and experimental results using both synthetic and real datasets demonstrate that our method surpasses existing approaches in terms of reconstruction accuracy and robustness. The code for the proposed method will be made publicly available on the project website: https://sites.google.com/view/con-nrsfm.

[168] UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao, Hongrui Wu, Ziyong Feng, Hujun Bao, Xiaowei Zhou, Sida Peng

Main category: cs.CV

TL;DR: UniVerse is a unified framework for robust 3D reconstruction from inconsistent multi-view images, using a video diffusion model to restore image consistency before reconstruction.

Details

Motivation: Existing methods for robust reconstruction rely heavily on dense observations and struggle with optimizing model parameters when dealing with inconsistent multi-view images.

Method: Decouples robust reconstruction into restoration and reconstruction subtasks. Uses a video diffusion model to first convert inconsistent images into initial videos, then restore them into consistent images, and finally reconstruct 3D scenes from the restored images.

Result: Extensive experiments on synthetic and real-world datasets demonstrate strong generalization capability and superior performance in robust reconstruction. The method can also control the style of reconstructed 3D scenes.

Conclusion: UniVerse provides an effective solution for robust 3D reconstruction from inconsistent multi-view images by leveraging diffusion models’ scene priors learned from large-scale data, overcoming limitations of per-view degradation modeling approaches.

Abstract: This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations.However, these methods rely heavily on dense observations for robustly optimizing model parameters.To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process.To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images.Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies.Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

[169] An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

Ke Jia, Ji Zhou, Hanxin Li, Zhigan Zhou, Haojie Chu, Xiaojie Li

Main category: cs.CV

TL;DR: A lightweight end-to-end template matching framework that jointly estimates position, rotation, and scaling for industrial applications, achieving high precision with 14ms inference time.

Details

Motivation: Traditional template matching methods are inefficient due to exhaustive enumeration of angles and scales, while deep learning approaches lack explicit geometric pose modeling, making them unsuitable for real-world industrial deployment.

Method: Proposes joint localization and geometric regression with Template-Aware Dynamic Convolution Module (TDCM) that injects template features dynamically. Uses depthwise separable convolutions, pixel shuffle, rotation-shear augmentation with pseudo labels, and a lightweight refinement module for local optimization.

Result: The 3.07M parameter model achieves high precision with 14ms inference under compound transformations. Demonstrates strong robustness in small-template and multi-object scenarios.

Conclusion: The framework is highly suitable for real-time industrial applications, providing efficient and precise template matching with explicit geometric pose estimation.

Abstract: In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target’s position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

[170] Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Xuchen Li, Xuzhao Li, Jiahui Gao, Renjie Pi, Shiyu Hu, Wentao Zhang

Main category: cs.CV

TL;DR: A framework for adaptive pixel reasoning in Vision-Language Models that dynamically determines when to use pixel-level operations based on query difficulty, achieving superior performance while significantly reducing unnecessary visual operations.

Details

Motivation: VLMs struggle with fine-grained visual understanding due to information loss during image encoding and insufficient attention to critical regions. Current pixel-level approaches are often overused, leading to inefficiency and distraction from irrelevant details.

Method: Operation-aware supervised fine-tuning followed by rollout-guided reinforcement learning that enables the VLM to determine when pixel operations should be invoked based on query difficulty, using feedback from the model’s own responses.

Result: Achieves 73.4% accuracy on HR-Bench 4K while maintaining only 20.1% tool usage ratio, improving accuracy and reducing tool usage by 66.5% compared to previous methods.

Conclusion: The proposed adaptive pixel reasoning framework enables VLMs to achieve better performance on fine-grained visual tasks while being significantly more efficient by dynamically controlling pixel-level operations.

Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model’s own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1%, improving accuracy and simultaneously reducing tool usage by 66.5% compared to the previous methods.

[171] Uncovering Overconfident Failures in CXR Models via Augmentation-Sensitivity Risk Scoring

Han-Jay Shu, Wei-Ning Chiu, Shun-Ting Chang, Meng-Ping Huang, Takeshi Tohyama, Ahram Han, Po-Chih Kuo

Main category: cs.CV

TL;DR: ASRS framework identifies error-prone chest X-ray cases using rotation augmentations and embedding shifts, improving selective prediction without labels.

Details

Motivation: Address fairness and reliability concerns in chest radiograph interpretation by detecting subtle within-distribution errors that existing confidence calibration and OOD methods miss.

Method: Apply ±15°/±30° rotations to CXRs, measure embedding shifts using RAD-DINO encoder, and stratify samples into stability quartiles based on sensitivity scores.

Result: Highly sensitive cases show substantially lower recall (-0.2 to -0.3) despite high AUROC and confidence, enabling effective error detection.

Conclusion: ASRS provides label-free selective prediction for clinician review, improving fairness and safety in medical AI systems.

Abstract: Deep learning models achieve strong performance in chest radiograph (CXR) interpretation, yet fairness and reliability concerns persist. Models often show uneven accuracy across patient subgroups, leading to hidden failures not reflected in aggregate metrics. Existing error detection approaches – based on confidence calibration or out-of-distribution (OOD) detection – struggle with subtle within-distribution errors, while image- and representation-level consistency-based methods remain underexplored in medical imaging. We propose an augmentation-sensitivity risk scoring (ASRS) framework to identify error-prone CXR cases. ASRS applies clinically plausible rotations ($\pm 15^\circ$/$\pm 30^\circ$) and measures embedding shifts with the RAD-DINO encoder. Sensitivity scores stratify samples into stability quartiles, where highly sensitive cases show substantially lower recall ($-0.2$ to $-0.3$) despite high AUROC and confidence. ASRS provides a label-free means for selective prediction and clinician review, improving fairness and safety in medical AI.

[172] FreeViS: Training-free Video Stylization with Inconsistent References

Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

Main category: cs.CV

TL;DR: FreeViS is a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence by integrating multiple stylized references into a pretrained image-to-video model.

Details

Motivation: Existing video stylization methods either suffer from temporal inconsistency when applied frame-by-frame, or require expensive paired video data and computational resources for training dedicated models.

Method: Integrates multiple stylized references to a pretrained I2V model, uses high-frequency compensation to constrain content layout and motion, and employs flow-based motion cues to preserve style textures in low-saliency regions.

Result: FreeViS delivers higher stylization fidelity and superior temporal consistency compared to recent baselines, achieving strong human preference without introducing flickers and stutters.

Conclusion: The training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization.

Abstract: Video stylization plays a key role in content creation, but it remains a challenging problem. Na"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

[173] MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu

Main category: cs.CV

TL;DR: MedQ-Bench is a new benchmark for evaluating medical image quality assessment using multi-modal LLMs, featuring perception and reasoning tasks across 5 imaging modalities and 40+ quality attributes.

Details

Motivation: Existing medical IQA approaches use scalar score-based metrics that don't capture human-like reasoning processes used by clinical experts, creating a gap for AI evaluation.

Method: Created MedQ-Bench with two tasks: MedQ-Perception (low-level visual attributes) and MedQ-Reasoning (no-reference and comparison tasks). Uses 2,600 perceptual queries and 708 reasoning assessments across diverse image sources including clinical, simulated degradations, and AI-generated images.

Result: Evaluation of 14 state-of-the-art MLLMs shows they have preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. Human-AI alignment validation confirms this gap.

Conclusion: Targeted optimization of MLLMs is needed for medical IQA. MedQ-Bench aims to catalyze further exploration and unlock MLLMs’ potential for medical image quality evaluation.

Abstract: Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

[174] Holistic Order Prediction in Natural Scenes

Pierre Musacchio, Hyunmin Lee, Jaesik Park

Main category: cs.CV

TL;DR: InstaFormer is a network that predicts full occlusion and depth orderings for all instances in a scene from a single RGB image input, using object queries and latent mask descriptors.

Details

Motivation: Existing systems for understanding instance-wise geometries require expensive input formats (category labels, segmentation masks) and have high inference costs (quadratic forward passes), which limits their practical application.

Method: InstaFormer uses interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information, enabling holistic order prediction in a single forward pass.

Result: The approach is comprehensively benchmarked and ablated to demonstrate its effectiveness in predicting occlusion and depth orderings.

Conclusion: InstaFormer provides an efficient solution for instance-wise geometry understanding by eliminating the need for expensive inputs and reducing inference costs to a single forward pass.

Abstract: Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: https://github.com/SNU-VGILab/InstaOrder.

[175] PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

Raahul Krishna Durairaju, K. Saruladha

Main category: cs.CV

TL;DR: PyramidStyler is a transformer framework with Pyramidal Positional Encoding that improves neural style transfer by capturing multi-scale details while reducing computational load, achieving real-time high-quality artistic rendering.

Details

Motivation: Existing CNN and transformer-based neural style transfer models struggle with scaling to complex styles and high-resolution inputs efficiently.

Method: Transformer framework with Pyramidal Positional Encoding (hierarchical multi-scale encoding) and reinforcement learning for dynamic stylization optimization.

Result: Reduced content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs with 1.39s inference time. With RL: content loss 2.03, style loss 0.75, inference time 1.40s.

Conclusion: PyramidStyler enables real-time, high-quality artistic rendering with broad applications in media and design.

Abstract: Neural Style Transfer (NST) has evolved from Gatys et al.’s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs–achieving 1.39 s inference–and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

[176] LOBE-GS: Load-Balanced and Efficient 3D Gaussian Splatting for Large-Scale Scene Reconstruction

Sheng-Hsiang Hung, Ting-Yu Yen, Wei-Fang Sun, Simon See, Shih-Hsuan Hung, Hung-Kuo Chu

Main category: cs.CV

TL;DR: LoBE-GS is a load-balanced and efficient 3D Gaussian Splatting framework that addresses scalability issues in large-scale 3D scene reconstruction by introducing depth-aware partitioning, optimization-based load balancing, and lightweight optimization techniques.

Details

Motivation: Scaling 3D Gaussian Splatting to large unbounded scenes like city blocks is challenging due to memory constraints and inefficient partitioning in existing divide-and-conquer methods, which suffer from load imbalance and high overhead in coarse-to-fine pipelines.

Method: Introduces depth-aware partitioning that reduces preprocessing time, optimization-based strategy to balance visible Gaussians across blocks, and two lightweight techniques: visibility cropping and selective densification to reduce training costs.

Result: Achieves up to 2× faster end-to-end training time than state-of-the-art baselines while maintaining reconstruction quality and enabling scalability to scenes that are infeasible with vanilla 3DGS.

Conclusion: LoBE-GS successfully addresses the scalability limitations of 3D Gaussian Splatting for large-scale scenes through efficient partitioning and load balancing strategies, making large urban and outdoor scene reconstruction more practical.

Abstract: 3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians – a strong proxy for computational load – across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to $2\times$ faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.

[177] Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Xuming He

Main category: cs.CV

TL;DR: The paper introduces MemoryPack for dynamic context modeling and Direct Forcing to mitigate error accumulation in long-form video generation, improving temporal consistency and training-inference alignment.

Details

Motivation: Long-form video generation faces challenges with capturing long-range dependencies and preventing error accumulation in autoregressive decoding, which limits practical usability.

Method: Proposes MemoryPack (learnable context-retrieval using text and image guidance) for joint short/long-term dependency modeling, and Direct Forcing (single-step approximation) for better training-inference alignment.

Result: Achieves minute-level temporal consistency, scales gracefully with video length, maintains linear complexity, and reduces error propagation during inference.

Conclusion: MemoryPack and Direct Forcing together enhance context consistency and reliability of long-form video generation, advancing practical usability of autoregressive video models.

Abstract: Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

[178] Calibrating the Full Predictive Class Distribution of 3D Object Detectors for Autonomous Driving

Cornelius Schröder, Marius-Raphael Schlüter, Markus Lienkamp

Main category: cs.CV

TL;DR: This paper proposes methods for confidence calibration in 3D object detectors, focusing on both dominant and secondary class predictions through auxiliary loss terms and post-hoc calibration techniques.

Details

Motivation: Precise object detection and uncertainty estimation are critical for autonomous systems' self-aware and safe operation, particularly for 3D object detectors' classification task.

Method: Proposed two auxiliary regularizing loss terms for calibration of either dominant prediction or full prediction vector, and evaluated post-hoc and train-time methods including isotonic regression on CenterPoint, PillarNet and DSVT-Pillar detectors.

Result: Combining the full class prediction calibration loss term with isotonic regression achieved best calibration for CenterPoint and PillarNet for both dominant and secondary predictions, but DSVT-Pillar could not be jointly calibrated using the same method.

Conclusion: Different 3D object detectors require tailored calibration approaches, with the proposed full prediction vector calibration method combined with isotonic regression being effective for some architectures but not universally applicable.

Abstract: In autonomous systems, precise object detection and uncertainty estimation are critical for self-aware and safe operation. This work addresses confidence calibration for the classification task of 3D object detectors. We argue that it is necessary to regard the calibration of the full predictive confidence distribution over all classes and deduce a metric which captures the calibration of dominant and secondary class predictions. We propose two auxiliary regularizing loss terms which introduce either calibration of the dominant prediction or the full prediction vector as a training goal. We evaluate a range of post-hoc and train-time methods for CenterPoint, PillarNet and DSVT-Pillar and find that combining our loss term, which regularizes for calibration of the full class prediction, and isotonic regression lead to the best calibration of CenterPoint and PillarNet with respect to both dominant and secondary class predictions. We further find that DSVT-Pillar can not be jointly calibrated for dominant and secondary predictions using the same method.

[179] Leveraging Prior Knowledge of Diffusion Model for Person Search

Giyeol Kim, Sooyoung Yang, Jihyong Oh, Myungjoo Kang, Chanho Eom

Main category: cs.CV

TL;DR: DiffPS is a novel person search framework that leverages pre-trained diffusion models to address optimization conflicts between detection and re-identification tasks, achieving state-of-the-art performance.

Details

Motivation: Existing methods use ImageNet pre-trained backbones that are suboptimal for person search's complex spatial context and fine-grained identity cues, and suffer from conflicting optimization objectives between detection and re-identification tasks.

Method: Proposes three specialized modules: Diffusion-Guided Region Proposal Network (DGRPN) for person localization, Multi-Scale Frequency Refinement Network (MSFRN) to reduce shape bias, and Semantic-Adaptive Feature Aggregation Network (SFAN) to utilize text-aligned diffusion features.

Result: DiffPS sets a new state-of-the-art performance on CUHK-SYSU and PRW datasets.

Conclusion: Leveraging diffusion model priors effectively resolves optimization conflicts in person search and enables superior performance through specialized modules for localization, bias mitigation, and feature aggregation.

Abstract: Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be suboptimal for capturing the complex spatial context and fine-grained identity cues necessary for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW.

[180] Flow-Matching Guided Deep Unfolding for Hyperspectral Image Reconstruction

Yi Ai, Yuanhao Cai, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: FMU integrates flow matching with deep unfolding for hyperspectral image reconstruction, using a mean velocity loss to improve accuracy and outperform existing methods.

Details

Motivation: Hyperspectral imaging is costly and reconstruction from compressed measurements suffers from degradation and loss of spectral details, requiring better reconstruction methods.

Method: Proposes Flow-Matching-guided Unfolding network (FMU) that combines flow matching generative prior with deep unfolding framework, enhanced by mean velocity loss for global consistency.

Result: Extensive experiments on simulated and real datasets show FMU significantly outperforms existing approaches in reconstruction quality.

Conclusion: FMU successfully integrates flow matching with deep unfolding, providing a robust and accurate solution for hyperspectral image reconstruction.

Abstract: Hyperspectral imaging (HSI) provides rich spatial-spectral information but remains costly to acquire due to hardware limitations and the difficulty of reconstructing three-dimensional data from compressed measurements. Although compressive sensing systems such as CASSI improve efficiency, accurate reconstruction is still challenged by severe degradation and loss of fine spectral details. We propose the Flow-Matching-guided Unfolding network (FMU), which, to our knowledge, is the first to integrate flow matching into HSI reconstruction by embedding its generative prior within a deep unfolding framework. To further strengthen the learned dynamics, we introduce a mean velocity loss that enforces global consistency of the flow, leading to a more robust and accurate reconstruction. This hybrid design leverages the interpretability of optimization-based methods and the generative capacity of flow matching. Extensive experiments on both simulated and real datasets show that FMU significantly outperforms existing approaches in reconstruction quality. Code and models will be available at https://github.com/YiAi03/FMU.

[181] Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

Wei-Lung Mao, Chun-Chi Wang, Po-Heng Chou, Yen-Ting Liu

Main category: cs.CV

TL;DR: Automated defect detection system for DIP components using YOLO models and ConSinGAN data augmentation, achieving 95.50% accuracy with YOLOv7.

Details

Motivation: Conventional defect detection is time-consuming and labor-intensive, creating burden on quality inspection personnel and making quality management difficult.

Method: Used ConSinGAN to generate dataset for training, investigated four YOLO models (v3, v4, v7, v9) with and without ConSinGAN augmentation, and developed SCADA system with sensor architecture.

Result: YOLOv7 with ConSinGAN achieved best performance: 95.50% accuracy and 285 ms detection time, significantly outperforming threshold-based approaches.

Conclusion: The proposed automated defect detection system can be easily established for various defect types and works well even with insufficient defect data.

Abstract: Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.

[182] Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Guangyao Zhai, Yue Zhou, Xinyan Deng, Lars Heckler, Nassir Navab, Benjamin Busam

Main category: cs.CV

TL;DR: FoundAD is a few-shot anomaly detection method that uses foundation visual encoders and learns a nonlinear projection operator to identify anomalies by measuring embedding differences from the natural image manifold.

Details

Motivation: Few-shot anomaly detection is challenging due to limited samples and category-agnostic conditions. Foundation visual encoders trained on large datasets can learn general normal image distributions, but existing methods struggle with accurate differentiation between normal and abnormal features.

Method: The method learns a nonlinear projection operator onto the natural image manifold. It leverages the observation that anomaly amount correlates with embedding differences, using this to characterize and identify out-of-distribution regions in images.

Result: Extensive experiments show FoundAD supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. The approach works with multiple foundation encoders including DINOv3.

Conclusion: This work broadens the perspective on foundation features and advances few-shot anomaly detection by providing an effective tool for identifying anomalies through embedding differences from the natural image manifold.

Abstract: Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.

[183] ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: ClustViT reduces Vision Transformer computational complexity for semantic segmentation by merging similar tokens using a trainable Cluster module and restoring details with a Regenerator module, achieving significant speedup with comparable accuracy.

Details

Motivation: Vision Transformers have quadratic attention complexity that limits their practical use in real-world robotic systems, especially for dense prediction tasks like semantic segmentation where existing token merging approaches designed for classification are less effective.

Method: Propose ClustViT architecture with a trainable Cluster module that merges similar tokens guided by pseudo-clusters from segmentation masks, followed by a Regenerator module that restores fine details for downstream processing.

Result: Achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three datasets while maintaining comparable segmentation accuracy to standard Vision Transformers.

Conclusion: ClustViT effectively addresses the computational complexity limitations of Vision Transformers for semantic segmentation tasks, making them more practical for real-world robotic applications while preserving performance.

Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, Nancy F. Chen, Shuicheng Yan, Xulei Yang, Xun Xu

Main category: cs.CV

TL;DR: PaDT introduces a unified paradigm for MLLMs to directly generate both textual and visual outputs using Visual Reference Tokens (VRTs) interleaved with text tokens, enabling dense prediction tasks like detection and segmentation.

Details

Motivation: Existing MLLM approaches for vision tasks rely on indirect representations like generating coordinates as text, which limits performance and prevents dense prediction tasks like segmentation.

Method: Proposes Patch-as-Decodable Token (PaDT) with Visual Reference Tokens (VRTs) from visual patch embeddings, interleaved with LLM output tokens. Uses lightweight decoder for predictions and dynamic embedding table expansion. Includes training strategy with random VRT selection and per-token cross-entropy loss.

Result: Achieves state-of-the-art performance across four visual perception and understanding tasks, outperforming significantly larger MLLM models.

Conclusion: PaDT provides a unified framework that enables MLLMs to directly generate diverse visual outputs, overcoming limitations of indirect representations and enabling dense prediction tasks.

Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM’s output textual tokens. A lightweight decoder then transforms LLM’s outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

[185] TriAlignXA: An Explainable Trilemma Alignment Framework for Trustworthy Agri-product Grading

Jianfei Xie, Ziyang Li

Main category: cs.CV

TL;DR: This paper addresses the trust deficit in online fruit/vegetable e-commerce by proposing a ‘Trust Pyramid’ model and ‘Triangular Trust Index’ (TTI) to quantify trade-offs in agricultural grading. It introduces TriAlignXA, an explainable AI framework that redefines algorithms as transparent decision-making providers rather than decision-makers.

Details

Motivation: The 'trust deficit' in online produce e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality, creating a fundamental barrier to consumer trust in online agricultural markets.

Method: The study constructs a ‘Trust Pyramid’ model through ‘dual-source verification’ and proposes the TriAlignXA framework with three optimization engines: Bio-Adaptive Engine for quality description, Timeliness Optimization Engine for processing efficiency, and Economic Optimization Engine for cost control. It also uses a ‘Pre-Mapping Mechanism’ with QR codes for transparent quality information.

Result: Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. The framework successfully balances the ‘impossible triangle’ of agricultural product grading (biological characteristics, timeliness, and economic viability), with empirical evidence and theoretical analysis verifying its balancing capability.

Conclusion: The research provides comprehensive support from theory to practice for building trustworthy online produce ecosystems, establishing a critical pathway from algorithmic decision-making to consumer trust through transparent, explainable AI systems that address fundamental trade-offs in agricultural grading.

Abstract: The ’trust deficit’ in online fruit and vegetable e-commerce stems from the inability of digital transactions to provide direct sensory perception of product quality. This paper constructs a ‘Trust Pyramid’ model through ‘dual-source verification’ of consumer trust. Experiments confirm that quality is the cornerstone of trust. The study reveals an ‘impossible triangle’ in agricultural product grading, comprising biological characteristics, timeliness, and economic viability, highlighting the limitations of traditional absolute grading standards. To quantitatively assess this trade-off, we propose the ‘Triangular Trust Index’ (TTI). We redefine the role of algorithms from ‘decision-makers’ to ‘providers of transparent decision-making bases’, designing the explainable AI framework–TriAlignXA. This framework supports trustworthy online transactions within agricultural constraints through multi-objective optimization. Its core relies on three engines: the Bio-Adaptive Engine for granular quality description; the Timeliness Optimization Engine for processing efficiency; and the Economic Optimization Engine for cost control. Additionally, the “Pre-Mapping Mechanism” encodes process data into QR codes, transparently conveying quality information. Experiments on grading tasks demonstrate significantly higher accuracy than baseline models. Empirical evidence and theoretical analysis verify the framework’s balancing capability in addressing the “impossible triangle”. This research provides comprehensive support–from theory to practice–for building a trustworthy online produce ecosystem, establishing a critical pathway from algorithmic decision-making to consumer trust.

[186] 4DGS-Craft: Consistent and Interactive 4D Gaussian Splatting Editing

Lei Liu, Can Wang, Zhenghao Chen, Dong Xu

Main category: cs.CV

TL;DR: 4DGS-Craft is a consistent and interactive 4D Gaussian Splatting editing framework that addresses view, temporal, and non-editing region consistency issues, and handles complex text instructions through LLM-based intent understanding.

Details

Motivation: To overcome challenges in 4D Gaussian Splatting editing including view, temporal, and non-editing region consistency problems, and difficulties in handling complex text instructions.

Method: Uses a 4D-aware InstructPix2Pix model with 4D VGGT geometry features and multi-view grid module for consistency, Gaussian selection mechanism for non-edited region preservation, and LLM-based module for user intent understanding and instruction decomposition.

Result: The framework enables more consistent and controllable 4D scene editing compared to related works, handling intricate user commands effectively.

Conclusion: 4DGS-Craft provides a comprehensive solution for consistent and interactive 4D Gaussian Splatting editing, addressing key challenges in the field through novel consistency mechanisms and intelligent user interaction capabilities.

Abstract: Recent advances in 4D Gaussian Splatting (4DGS) editing still face challenges with view, temporal, and non-editing region consistency, as well as with handling complex text instructions. To address these issues, we propose 4DGS-Craft, a consistent and interactive 4DGS editing framework. We first introduce a 4D-aware InstructPix2Pix model to ensure both view and temporal consistency. This model incorporates 4D VGGT geometry features extracted from the initial scene, enabling it to capture underlying 4D geometric structures during editing. We further enhance this model with a multi-view grid module that enforces consistency by iteratively refining multi-view input images while jointly optimizing the underlying 4D scene. Furthermore, we preserve the consistency of non-edited regions through a novel Gaussian selection mechanism, which identifies and optimizes only the Gaussians within the edited regions. Beyond consistency, facilitating user interaction is also crucial for effective 4DGS editing. Therefore, we design an LLM-based module for user intent understanding. This module employs a user instruction template to define atomic editing operations and leverages an LLM for reasoning. As a result, our framework can interpret user intent and decompose complex instructions into a logical sequence of atomic operations, enabling it to handle intricate user commands and further enhance editing performance. Compared to related works, our approach enables more consistent and controllable 4D scene editing. Our code will be made available upon acceptance.

[187] Pure-Pass: Fine-Grained, Adaptive Masking for Dynamic Token-Mixing Routing in Lightweight Image Super-Resolution

Junyu Wu, Jie Tang, Jie Liu, Gangshan Wu

Main category: cs.CV

TL;DR: Pure-Pass (PP) is a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations in image super-resolution, achieving superior performance with minimal overhead.

Details

Motivation: Existing lightweight SR methods like CAMixer have limitations including poor adaptability, coarse-grained masking, and spatial inflexibility, which hinder practical deployment due to computational complexity.

Method: PP uses fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. It’s integrated into the ATD-light model.

Result: PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving similar computation.

Conclusion: The proposed Pure-Pass mechanism effectively addresses limitations of previous methods by providing fine-grained pixel-level masking that improves both performance and efficiency in image super-resolution.

Abstract: Image Super-Resolution (SR) aims to reconstruct high-resolution images from low-resolution counterparts, but the computational complexity of deep learning-based methods often hinders practical deployment. CAMixer is the pioneering work to integrate the advantages of existing lightweight SR methods and proposes a content-aware mixer to route token mixers of varied complexities according to the difficulty of content recovery. However, several limitations remain, such as poor adaptability, coarse-grained masking and spatial inflexibility, among others. We propose Pure-Pass (PP), a pixel-level masking mechanism that identifies pure pixels and exempts them from expensive computations. PP utilizes fixed color center points to classify pixels into distinct categories, enabling fine-grained, spatially flexible masking while maintaining adaptive flexibility. Integrated into the state-of-the-art ATD-light model, PP-ATD-light achieves superior SR performance with minimal overhead, outperforming CAMixer-ATD-light in reconstruction quality and parameter efficiency when saving a similar amount of computation.

[188] Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, Hiroshi Fujita

Main category: cs.CV

TL;DR: Proposed Self-correction Loop with Structured Output (SLSO) framework using GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs, improving accuracy through iterative correction loops.

Details

Motivation: To improve the accuracy of automated jaw cyst finding generation on dental radiographs by addressing inconsistencies and hallucinations in conventional methods.

Method: 10-step process using SLSO framework with iterative regeneration when inconsistencies detected, compared against conventional Chain-of-Thought method across 7 evaluation items on 22 jaw cyst cases.

Result: SLSO improved output accuracy with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption respectively. Achieved consistent structured output after up to 5 regenerations in successful cases.

Conclusion: SLSO framework effectively enforced negative finding descriptions, suppressed hallucinations, and improved tooth number accuracy, though limited for extensive lesions spanning multiple teeth. Further refinement needed for practical implementation.

Abstract: In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.

[189] LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction

Mario Resino, Borja Pérez, Jaime Godoy, Abdulla Al-Kaff, Fernando García

Main category: cs.CV

TL;DR: LiLa-Net is a 3D autoencoder that uses LiDAR point clouds from real traffic environments, featuring simplified skip connections and reduced encoder layers for efficient performance while maintaining accurate reconstruction.

Details

Motivation: To develop an efficient 3D autoencoder architecture that can encode features from LiDAR point clouds in real traffic environments without requiring extensive computational resources like state-of-the-art architectures.

Method: Proposed LiLa-Net architecture with reduced encoder layers and simplified skip connections, leveraging real semi-autonomous vehicle data with Velodyne LiDAR to create an efficient latent space for point cloud reconstruction.

Result: The model achieves effective balance between skip connection information and latent encoding, leading to improved reconstruction quality without performance compromise, and demonstrates strong generalization by reconstructing objects outside the original traffic environment.

Conclusion: LiLa-Net successfully creates an efficient 3D autoencoder that accurately reconstructs LiDAR point clouds with reduced computational requirements while maintaining strong generalization capabilities.

Abstract: This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR’s point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.

[190] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring

Jenna Kline, Maksim Kholiavchenko, Samuel Stevens, Nina van Tiel, Alison Zhong, Namrata Banerji, Alec Sheets, Sowbaranika Balasubramaniam, Isla Duporge, Matthew Thompson, Elizabeth Campolongo, Jackson Miliko, Neil Rosser, Tanya Berger-Wolf, Charles V. Stewart, Daniel I. Rubenstein

Main category: cs.CV

TL;DR: kabr-tools is an open-source package for automated multi-species behavioral monitoring using drone-based video and machine learning, enabling scalable analysis of complex behavioral patterns in wildlife.

Details

Motivation: Traditional field observations are limited in scope, time-consuming, and labor-intensive, hindering comprehensive assessment of behavioral responses across landscapes.

Method: Integrates drone-based video with machine learning systems including object detection, tracking, and behavioral classification to extract behavioral, social, and spatial metrics from wildlife footage.

Result: Drone-based observations improved behavioral granularity by 15% visibility loss reduction and captured more transitions with higher accuracy. Case studies analyzed 969 behavioral sequences, revealing species-specific behavioral patterns and spatial segregation.

Conclusion: kabr-tools enables automated behavioral monitoring at scale, offering a powerful tool for ecosystem-wide studies that advances conservation, biodiversity research, and ecological monitoring.

Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy’s zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy’s zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy’s zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.

[191] GaussianMorphing: Mesh-Guided 3D Gaussians for Semantic-Aware Object Morphing

Mengtian Li, Yunshu Bai, Yimin Chu, Yijun Shen, Zhongmei Li, Weifeng Ge, Zhifeng Xie, Chaofeng Chen

Main category: cs.CV

TL;DR: GaussianMorphing is a novel framework for semantic-aware 3D shape and texture morphing from multi-view images using mesh-guided 3D Gaussian Splatting.

Details

Motivation: To overcome limitations of previous approaches that rely on point clouds or require pre-defined homeomorphic mappings for untextured data.

Method: Uses mesh-guided 3D Gaussian Splatting with unified deformation strategy that anchors 3D Gaussians to mesh patches, ensuring geometrically consistent transformations and preserving texture fidelity through topology-aware constraints.

Result: Outperforms prior 2D/3D methods on TexMorph benchmark, reducing color consistency error (ΔE) by 22.2% and EI by 26.2%.

Conclusion: The framework enables semantic-aware 3D morphing that preserves both local detail and global semantic coherence without requiring labeled data.

Abstract: We introduce GaussianMorphing, a novel framework for semantic-aware 3D shape and texture morphing from multi-view images. Previous approaches usually rely on point clouds or require pre-defined homeomorphic mappings for untextured data. Our method overcomes these limitations by leveraging mesh-guided 3D Gaussian Splatting (3DGS) for high-fidelity geometry and appearance modeling. The core of our framework is a unified deformation strategy that anchors 3DGaussians to reconstructed mesh patches, ensuring geometrically consistent transformations while preserving texture fidelity through topology-aware constraints. In parallel, our framework establishes unsupervised semantic correspondence by using the mesh topology as a geometric prior and maintains structural integrity via physically plausible point trajectories. This integrated approach preserves both local detail and global semantic coherence throughout the morphing process with out requiring labeled data. On our proposed TexMorph benchmark, GaussianMorphing substantially outperforms prior 2D/3D methods, reducing color consistency error ($\Delta E$) by 22.2% and EI by 26.2%. Project page: https://baiyunshu.github.io/GAUSSIANMORPHING.github.io/

[192] Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor, Romit Roy Choudhury

Main category: cs.CV

TL;DR: InPose: A diffusion-based pose estimation method that uses only rotational measurements and generative priors for zero-shot generalization across users.

Details

Motivation: Existing pose estimation methods using conditional diffusion models generalize poorly across users because location measurements are highly influenced by body size variations.

Method: Formulates pose estimation as an inverse problem using a pre-trained diffusion model conditioned only on rotational measurements, guided by a likelihood term derived from measured locations.

Result: Proposed InPose method can generatively estimate pose sequences that best explain sparse on-body measurements for any user.

Conclusion: The approach enables zero-shot generalization across users by decoupling body-size-dependent location measurements from the generative process.

Abstract: Pose estimation refers to tracking a human’s full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both <location, rotation> measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

[193] VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation

Arman Behnam

Main category: cs.CV

TL;DR: VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation - a transformer-driven diffusion framework that combines global contextual reasoning with iterative denoising to improve brain tumor segmentation accuracy and boundary precision.

Details

Motivation: Convolutional architectures like U-Net have limited capacity to capture long-range dependencies, which constrains performance on complex tumor structures. Diffusion models show strong potential for generating high-fidelity medical images and refining segmentation boundaries.

Method: Embed a vision transformer at the core of the diffusion process to leverage global contextual reasoning together with iterative denoising. The transformer backbone enables effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details.

Result: Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance metrics.

Conclusion: The hybrid transformer-diffusion design provides improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines and advancing the state of the art in tumor segmentation.

Abstract: Accurate detection and segmentation of brain tumors from magnetic resonance imaging (MRI) are essential for diagnosis, treatment planning, and clinical monitoring. While convolutional architectures such as U-Net have long been the backbone of medical image segmentation, their limited capacity to capture long-range dependencies constrains performance on complex tumor structures. Recent advances in diffusion models have demonstrated strong potential for generating high-fidelity medical images and refining segmentation boundaries. In this work, we propose VGDM: Vision-Guided Diffusion Model for Brain Tumor Detection and Segmentation framework, a transformer-driven diffusion framework for brain tumor detection and segmentation. By embedding a vision transformer at the core of the diffusion process, the model leverages global contextual reasoning together with iterative denoising to enhance both volumetric accuracy and boundary precision. The transformer backbone enables more effective modeling of spatial relationships across entire MRI volumes, while diffusion refinement mitigates voxel-level errors and recovers fine-grained tumor details. This hybrid design provides a pathway toward improved robustness and scalability in neuro-oncology, moving beyond conventional U-Net baselines. Experimental validation on MRI brain tumor datasets demonstrates consistent gains in Dice similarity and Hausdorff distance, underscoring the potential of transformer-guided diffusion models to advance the state of the art in tumor segmentation.

[194] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques

Walid Rabehi, Marion Le Texier, Rémi Lemoy

Main category: cs.CV

TL;DR: A deep learning pipeline using dual-pass U-Net to extract urban footprints from historical French maps (1925-1950), creating the first national-scale urban dataset for this period with 73% accuracy.

Details

Motivation: To overcome the lack of nationwide digital urban footprint data for historical urban sprawl analysis in France before the 1970s, particularly using the Scan Histo historical map series.

Method: Dual-pass U-Net approach: first pass generates preliminary map to identify confusing areas (text, roads), second pass uses refined dataset and binarized output to minimize radiometric noise. Deployed on HPC cluster to process 941 high-resolution tiles covering France.

Result: Produced first open-access national-scale urban footprint dataset for 1925-1950 period with 73% overall accuracy, effectively capturing diverse urban patterns while overcoming artifacts like labels and contour lines.

Conclusion: Successfully bridged the data gap for historical urban analysis in France, with openly released code, training datasets, and nationwide urban raster to support future urbanization dynamics research.

Abstract: Quantitative analysis of historical urban sprawl in France before the 1970s is hindered by the lack of nationwide digital urban footprint data. This study bridges this gap by developing a scalable deep learning pipeline to extract urban areas from the Scan Histo historical map series (1925-1950), which produces the first open-access, national-scale urban footprint dataset for this pivotal period. Our key innovation is a dual-pass U-Net approach designed to handle the high radiometric and stylistic complexity of historical maps. The first pass, trained on an initial dataset, generates a preliminary map that identifies areas of confusion, such as text and roads, to guide targeted data augmentation. The second pass uses a refined dataset and the binarized output of the first model to minimize radiometric noise, which significantly reduces false positives. Deployed on a high-performance computing cluster, our method processes 941 high-resolution tiles covering the entirety of metropolitan France. The final mosaic achieves an overall accuracy of 73%, effectively capturing diverse urban patterns while overcoming common artifacts like labels and contour lines. We openly release the code, training datasets, and the resulting nationwide urban raster to support future research in long-term urbanization dynamics.

[195] When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos

Woowon Jang, Jiwon Im, Juseung Choi, Niki Rashidian, Wesley De Neve, Utku Ozbulak

Main category: cs.CV

TL;DR: Point-based tracking in surgical videos is competitive for tools but underperforms for anatomical targets due to tissue similarity and ambiguous boundaries.

Details

Motivation: To systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos and understand its reliability in complex surgical environments.

Method: Compared point-based tracking performance with segmentation mask initialization for three surgical targets (gallbladder, grasper, L-hook electrocautery) in laparoscopic cholecystectomy videos.

Result: Point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets where tissue similarity and ambiguous boundaries lead to failure.

Conclusion: Provides actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis, highlighting key factors influencing tracking outcomes.

Abstract: Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

[196] FRIEREN: Federated Learning with Vision-Language Regularization for Segmentation

Ding-Ruei Shen

Main category: cs.CV

TL;DR: Proposes FRIEREN framework for FFREEDG task - federated learning for semantic segmentation using only unlabeled client data after pretraining on labeled server data, leveraging vision foundation models with vision-language integration and consistency learning.

Details

Motivation: Address challenges in federated learning for semantic segmentation with domain shifts when client data is unlabeled, overcoming limitations of existing methods that assume labeled client data or don't leverage modern vision foundation models.

Method: Uses vision-language decoder guided by CLIP-based text embeddings for semantic disambiguation, employs weak-to-strong consistency learning strategy for robust local training on pseudo-labels, integrates vision and language modalities from vision foundation models.

Result: Achieves competitive performance against established domain generalization and adaptation methods on synthetic-to-real and clear-to-adverse-weather benchmarks, effectively tackles the FFREEDG task.

Conclusion: Sets a strong baseline for future research in federated learning for semantic segmentation with unlabeled client data and domain shifts, demonstrating the effectiveness of vision-language integration and consistency learning strategies.

Abstract: Federeated Learning (FL) offers a privacy-preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server’s labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re-accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision-Language decoder guided by CLIP-based text embeddings to improve semantic disambiguation and uses a weak-to-strong consistency learning strategy for robust local training on pseudo-labels. Our experiments on synthetic-to-real and clear-to-adverse-weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

[197] Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang, Jing Zhang

Main category: cs.CV

TL;DR: ASK-Hint is a structured prompting framework for video anomaly detection that uses action-centric knowledge to improve frozen vision-language models’ performance by providing fine-grained guiding questions aligned with discriminative visual cues.

Details

Motivation: Existing prompting methods for video anomaly detection are often too abstract and overlook fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos.

Method: Organizes prompts into semantically coherent groups (e.g., violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues, leveraging action-centric knowledge.

Result: Extensive experiments on UCF-Crime and XD-Violence show ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods.

Conclusion: ASK-Hint establishes a new training-free and generalizable solution for explainable video anomaly detection, highlighting the critical role of prompt granularity and providing interpretable reasoning traces.

Abstract: Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

[198] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen

Main category: cs.CV

TL;DR: GeoPurify addresses the trade-off in 3D semantic segmentation by using a Student Affinity Network to purify 2D VLM-generated 3D point features with geometric priors from a 3D self-supervised teacher model, achieving state-of-the-art performance with only 1.5% training data.

Details

Motivation: Current methods for transferring 2D VLM features to 3D segmentation face a trade-off: direct projection yields noisy predictions while geometric coherence requires costly training and large annotated datasets. The segmentation-and-matching paradigm fails to reconcile 2D semantics with 3D structure.

Method: Proposes GeoPurify with a Student Affinity Network that purifies 2D VLM-generated 3D point features using geometric priors from a 3D self-supervised teacher model. Includes a Geometry-Guided Pooling module during inference for denoising and ensuring semantic-structural consistency.

Result: GeoPurify achieves or surpasses state-of-the-art performance on major 3D benchmarks while using only about 1.5% of training data, effectively mitigating the trade-off between projection noise and geometric coherence.

Conclusion: GeoPurify demonstrates that latent geometric information in noisy 2D-to-3D features can be effectively exploited through geometric purification, enabling superior data efficiency and performance in 3D semantic segmentation without extensive training pipelines.

Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at https://github.com/tj12323/GeoPurify.

[199] Cross-Breed Pig Identification Using Auricular Vein Pattern Recognition: A Machine Learning Approach for Small-Scale Farming Applications

Emmanuel Nsengiyumvaa, Leonard Niyitegekaa, Eric Umuhoza

Main category: cs.CV

TL;DR: A noninvasive pig identification system using auricular vein patterns achieved 98.12% accuracy with SVM classification, offering a cost-effective alternative to traditional methods like ear tags.

Details

Motivation: Traditional pig identification methods (ear tags, microchips) are unreliable, costly, target pure breeds, and impractical for small-scale farmers, creating a need for better solutions.

Method: Collected 800 ear images from 20 mixed-breed pigs using smartphone and back lighting, developed multistage computer vision pipeline for vein enhancement and feature extraction, used machine learning models for classification.

Result: Support Vector Machines achieved 98.12% precision in pig identification across mixed-breed populations, with entire process taking average 8.3 seconds from image to classification.

Conclusion: Auricular vein biometrics provide a cost-effective, stress-free alternative to physical identifiers, confirming practicality for digitizing livestock management and extending precision farming benefits to resource-constrained communities.

Abstract: Accurate livestock identification is a cornerstone of modern farming: it supports health monitoring, breeding programs, and productivity tracking. However, common pig identification methods, such as ear tags and microchips, are often unreliable, costly, target pure breeds, and thus impractical for small-scale farmers. To address this gap, we propose a noninvasive biometric identification approach that leverages uniqueness of the auricular vein patterns. To this end, we have collected 800 ear images from 20 mixed-breed pigs (Landrace cross Pietrain and Duroc cross Pietrain), captured using a standard smartphone and simple back lighting. A multistage computer vision pipeline was developed to enhance vein visibility, extract structural and spatial features, and generate biometric signatures. These features were then classified using machine learning models. Support Vector Machines (SVM) achieved the highest accuracy: correctly identifying pigs with 98.12% precision across mixed-breed populations. The entire process from image processing to classification was completed in an average of 8.3 seconds, demonstrating feasibility for real-time farm deployment. We believe that by replacing fragile physical identifiers with permanent biological markers, this system provides farmers with a cost-effective and stress-free method of animal identification. More broadly, the findings confirm the practicality of auricular vein biometrics for digitizing livestock management, reinforcing its potential to extend the benefits of precision farming to resource-constrained agricultural communities.

[200] MMDEW: Multipurpose Multiclass Density Estimation in the Wild

Villanelle O’Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James Brown

Main category: cs.CV

TL;DR: A multi-category counting framework using Twins pyramid vision-transformer backbone with multi-class counting head and Category Focus Module, achieving superior performance on crowd counting benchmarks and demonstrating applicability to biodiversity monitoring.

Details

Motivation: To address object counting in dense and occluded scenes where discrete counting-by-detection methods fail, particularly for multi-category scenarios with inter-category interference.

Method: Uses Twins pyramid vision-transformer backbone with specialized multi-class counting head based on multiscale decoding. Includes two-task design with segmentation-based Category Focus Module to suppress inter-category cross-talk during training.

Result: Achieved 33%, 43% and 64% reduction in MAE on VisDrone and iSAID benchmarks compared to prior multi-category crowd-counting approaches. Outperformed YOLOv11 in dense scenes.

Conclusion: The method’s regional loss enables multi-class crowd counting applications in new domains like biodiversity monitoring, providing scalable ecological insights for conservation efforts.

Abstract: Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method’s regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.

[201] TempoControl: Temporal Attention Guidance for Text-to-Video Models

Shira Schiber, Ofir Lindenbaum, Idan Schwartz

Main category: cs.CV

TL;DR: TempoControl enables fine-grained temporal control in text-to-video generation by using cross-attention maps to align visual concepts with timing signals during inference, without retraining.

Details

Motivation: Current generative video models lack fine-grained temporal control, preventing users from specifying when specific visual elements should appear in generated sequences.

Method: TempoControl uses cross-attention maps from text-to-video diffusion models and guides timing through a novel optimization approach with three principles: temporal shape alignment via correlation, amplification via energy, and spatial focus maintenance via entropy.

Result: The method achieves precise timing control while maintaining high video quality and diversity across applications including temporal reordering, action generation, and audio-aligned generation.

Conclusion: TempoControl provides effective temporal alignment of visual concepts during video generation without requiring retraining or additional supervision.

Abstract: Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.

[202] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang

Main category: cs.CV

TL;DR: RewardMap is a multi-stage RL framework that improves multimodal LLMs’ fine-grained visual reasoning by using difficulty-aware rewards and staged training from perception to reasoning tasks, achieving 3.47% average improvement across 6 benchmarks.

Details

Motivation: Advanced MLLMs struggle with fine-grained visual reasoning in structured settings like transit maps, and standard RL faces challenges with sparse rewards and unstable optimization.

Method: Constructed ReasonMap-Plus dataset with dense VQA rewards, then proposed RewardMap with difficulty-aware reward design and multi-stage RL scheme that bootstraps from simple perception to complex reasoning tasks.

Result: Each component of RewardMap contributes to consistent performance gains, with combined approach achieving best results. Models show 3.47% average improvement across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks.

Conclusion: RewardMap effectively addresses sparse reward challenges in visual reasoning and provides a better cold-start strategy than conventional SFT, enhancing MLLMs’ visual understanding and reasoning capabilities.

Abstract: Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

[203] DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: DragFlow is the first framework to effectively use FLUX’s strong generative priors for drag-based image editing, overcoming limitations of previous methods by introducing region-based editing with affine transformations and integrating personalization adapters for better subject consistency.

Details

Motivation: Previous drag-based editing methods suffer from distortions due to insufficient priors from Stable Diffusion. With the shift to stronger DiT-based models like FLUX, there's an opportunity to improve drag editing, but current methods don't leverage these enhanced priors effectively.

Method: DragFlow introduces region-based editing with affine transformations for richer feature supervision, integrates IP-Adapter for subject consistency, uses gradient masks for background preservation, and employs MLLMs to resolve task ambiguities. It also creates a new benchmark called ReD Bench for evaluation.

Result: Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, achieving state-of-the-art performance in drag-based image editing with substantial gains over existing methods.

Conclusion: DragFlow successfully harnesses FLUX’s strong generative priors for drag-based editing, overcoming the limitations of previous approaches through region-based supervision and comprehensive integration of personalization techniques, setting a new standard in the field.

Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

[204] From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

Main category: cs.CV

TL;DR: F2C addresses the “needle in a haystack” problem in Video LLMs by selecting key clips instead of isolated frames, maintaining temporal coherence while keeping token count constant through adaptive resolution.

Details

Motivation: Current Video LLMs suffer from excessive visual tokens that exhaust context windows, and existing frame-wise selection methods discard essential temporal dynamics, leading to poor reasoning about motion and event continuity.

Method: Proposes F2C which selects key clips (short, temporally coherent segments) instead of isolated frames, and uses adaptive resolution to balance spatial resolution and clip length while maintaining constant token count per video.

Result: Outperforms uniform sampling by 8.1% on Video-MME, 5.6% on LongVideoBench, and 10.3% on MLVU benchmarks. The training-free approach demonstrates significant improvements in long-form video understanding.

Conclusion: Preserving temporal coherence through clip selection is crucial for video understanding, and F2C provides a practical solution for scaling Video LLMs to real-world applications.

Abstract: Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the “needle in a haystack” problem: the massive number of visual tokens produced from raw video frames exhausts the model’s context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

[205] Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities

Mario Medrano-Paredes, Carmen Fernández-González, Francisco-Javier Díaz-Pernas, Hichem Saoudi, Javier González-Alonso, Mario Martínez-Zarzuela

Main category: cs.CV

TL;DR: This study compares video-based 3D human pose estimation models with IMU sensors for movement assessment, finding MotionAGFormer performs best among video methods and both technologies are viable for out-of-lab kinematic analysis.

Details

Motivation: To enable accurate human movement assessment outside specialized labs for telemedicine, sports science, and rehabilitation by comparing the viability of video-based vs. IMU-based approaches.

Method: Used VIDIMU dataset with 13 daily activities captured by commodity cameras and 5 IMUs. Evaluated joint angles from MotionAGFormer, MotionBERT, MMPose 2D-to-3D, and NVIDIA BodyTrack against IMU data processed through OpenSim inverse kinematics using Human3.6M format with 17 keypoints.

Result: MotionAGFormer achieved best performance with lowest RMSE (9.27° ± 4.80°), lowest MAE (7.86° ± 4.18°), highest Pearson correlation (0.86 ± 0.15), and highest R² (0.67 ± 0.28). Both video and IMU approaches are viable for out-of-lab assessment.

Conclusion: Video models provide clinically promising kinematics in healthy adults but lag behind IMU-based estimates in some areas. The study establishes guidelines for developing robust, cost-effective telehealth solutions, highlighting trade-offs between video and sensor approaches in cost, accessibility, and precision.

Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

[206] NeuroSwift: A Lightweight Cross-Subject Framework for fMRI Visual Reconstruction of Complex Scenes

Shiyi Zhang, Dong Liang, Yihang Zhou

Main category: cs.CV

TL;DR: NeuroSwift is a diffusion-based method that integrates AutoKL and CLIP adapters for cross-subject visual stimulus reconstruction from fMRI data, achieving state-of-the-art performance with minimal training time.

Details

Motivation: To overcome challenges in cross-subject visual reconstruction from brain activity, including inter-subject variability and the brain's abstract semantic encoding of complex visual inputs.

Method: Integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. Uses Stable Diffusion generated images with COCO captions to train CLIP Adapter. Employs pretraining on one subject followed by fine-tuning only 17% of parameters for new subjects.

Result: Achieves state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), outperforming existing methods.

Conclusion: NeuroSwift enables efficient and accurate cross-subject visual stimulus reconstruction from fMRI data through its adapter-based diffusion approach and parameter-efficient fine-tuning strategy.

Abstract: Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain’s abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift’s CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.

[207] microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan

Main category: cs.CV

TL;DR: microCLIP is a self-training framework that enhances CLIP’s fine-grained classification by fusing saliency-guided local features with global representations and using LLM-derived classifiers for stable adaptation.

Details

Motivation: CLIP's reliance on coarse global features limits its performance on fine-grained classification tasks, and previous methods that align LLM descriptions with CLIP's [CLS] token lack spatial precision.

Method: Uses Saliency-Oriented Attention Pooling (SOAP) to create saliency-guided [FG] tokens from patch embeddings, fuses them with global [CLS] tokens, and employs a two-headed LLM-derived classifier with Dynamic Knowledge Aggregation for stable pseudo-label refinement.

Result: Achieves a consistent 2.90% average accuracy gain across 13 fine-grained benchmarks with only light adaptation.

Conclusion: microCLIP effectively uncovers latent fine-grained signals in CLIP through joint refinement of visual and textual representations, improving fine-grained classification performance.

Abstract: Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

[208] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu

Main category: cs.CV

TL;DR: VidGuard-R1 is a video authenticity detector that fine-tunes a multi-modal large language model using group relative policy optimization to provide accurate AI-generated video detection with interpretable explanations.

Details

Motivation: Address the urgent need for effective detection tools to mitigate societal risks from AI-generated videos, including misinformation and reputational harm, while ensuring transparency through interpretable explanations for regulators and end users.

Method: Fine-tunes Qwen-VL multi-modal large language model using group relative policy optimization (GRPO) with two specialized reward models targeting temporal artifacts and generation complexity, trained on a challenging dataset of 140k real and AI-generated videos.

Result: Achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Produces precise and interpretable rationales behind predictions.

Conclusion: VidGuard-R1 successfully addresses the need for both accurate detection and interpretable explanations in AI-generated video authentication, demonstrating superior performance through specialized fine-tuning approaches.

Abstract: With the rapid advancement of AI-generated videos, there is an urgent need for effective detection tools to mitigate societal risks such as misinformation and reputational harm. In addition to accurate classification, it is essential that detection models provide interpretable explanations to ensure transparency for regulators and end users. To address these challenges, we introduce VidGuard-R1, the first video authenticity detector that fine-tunes a multi-modal large language model (MLLM) using group relative policy optimization (GRPO). Our model delivers both highly accurate judgments and insightful reasoning. We curate a challenging dataset of 140k real and AI-generated videos produced by state-of-the-art generation models, carefully designing the generation process to maximize discrimination difficulty. We then fine-tune Qwen-VL using GRPO with two specialized reward models that target temporal artifacts and generation complexity. Extensive experiments demonstrate that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing benchmarks, with additional training pushing accuracy above 95%. Case studies further show that VidGuard-R1 produces precise and interpretable rationales behind its predictions. The code is publicly available at https://VidGuard-R1.github.io.

[209] Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: Proposes a method to improve long-horizon video generation by using teacher models to guide student models through self-generated video segments, avoiding quality degradation without requiring long-video supervision.

Details

Motivation: Current diffusion models for video generation suffer from quality degradation when extending beyond training horizons due to error compounding in latent space, and require expensive long-video training data.

Method: Uses teacher models to provide guidance for student models through sampled segments from self-generated long videos, maintaining temporal consistency without recomputing overlapping frames.

Result: Achieves 20x video length scaling beyond teacher capability, generates videos up to 4 minutes 15 seconds (50x longer than baseline), and outperforms baselines in fidelity and consistency on benchmarks.

Conclusion: The approach effectively mitigates quality degradation in long-horizon video generation without requiring long-video supervision or retraining, demonstrating significant improvements in scaling video length while maintaining quality.

Abstract: Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/

[210] Learning to Generate Object Interactions with Physics-Guided Video Diffusion

David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev

Main category: cs.CV

TL;DR: KineMask introduces physics-guided video generation with rigid body control, enabling realistic object interactions and motion effects from single images and velocity inputs.

Details

Motivation: Current video generation models struggle with physically plausible object interactions and lack physics-grounded control mechanisms, limiting their use as world simulators for robotics and embodied decision making.

Method: Two-stage training strategy that gradually removes future motion supervision via object masks, training video diffusion models on synthetic scenes with simple interactions and integrating low-level motion control with high-level textual conditioning.

Result: Significant improvements in object interactions in real scenes, strong improvements over recent models of comparable size, and effective synthesis of complex dynamical phenomena.

Conclusion: KineMask successfully addresses limitations in physics-grounded video generation, with ablation studies confirming the complementary roles of low- and high-level conditioning in video diffusion models.

Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.

[211] MultiModal Action Conditioned Video Generation

Yichen Li, Antonio Torralba

Main category: cs.CV

TL;DR: The paper introduces fine-grained multimodal actions incorporating proprioception, kinesthesia, force haptics, and muscle activation to address limitations of current video models in real-time fine motor control for household robots.

Details

Motivation: Current video models fail as world models due to lack of fine-grained control needed for delicate tasks and urgent situations in household robotics.

Method: Developed a feature learning paradigm that aligns multimodal senses while preserving unique information from each modality, with a regularization scheme to enhance causality of action trajectory features.

Result: Experiments show improved simulation accuracy and reduced temporal drift when incorporating multimodal senses. Ablation studies and downstream applications demonstrate effectiveness.

Conclusion: The proposed multimodal approach enables fine-grained interactions that are difficult to simulate with text-conditioned generative models, showing practical benefits for robotic control.

Abstract: Current video models fail as world model as they lack fine-graiend control. General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. We consider senses of proprioception, kinesthesia, force haptics, and muscle activation. Such multimodal senses naturally enables fine-grained interactions that are difficult to simulate with text-conditioned generative models. To effectively simulate fine-grained multisensory actions, we develop a feature learning paradigm that aligns these modalities while preserving the unique information each modality provides. We further propose a regularization scheme to enhance causality of the action trajectory features in representing intricate interaction dynamics. Experiments show that incorporating multimodal senses improves simulation accuracy and reduces temporal drift. Extensive ablation studies and downstream applications demonstrate the effectiveness and practicality of our work.

[212] VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Main category: cs.CV

TL;DR: VideoNSA adapts Native Sparse Attention to video-language models, enabling reliable scaling to 128K tokens and improved performance on long-video understanding tasks through a hardware-aware hybrid attention approach.

Details

Motivation: Video understanding in multimodal language models is limited by context length constraints, causing models to miss key transition frames and struggle with coherence across long time scales.

Method: Adapt Native Sparse Attention (NSA) to video-language models by adapting Qwen2.5-VL through end-to-end training on a 216K video instruction dataset, using hardware-aware hybrid attention (dense for text, NSA for video).

Result: VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks compared to token-compression and training-free sparse baselines.

Conclusion: The approach enables reliable scaling to 128K tokens, reveals optimal global-local attention allocation, task-dependent branch usage patterns, and learnable combined sparse attention helps induce dynamic attention sinks.

Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

[213] NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

Main category: cs.CV

TL;DR: NoiseShift is a training-free method that recalibrates noise levels in diffusion models based on resolution size to improve low-resolution image generation without architectural changes.

Details

Motivation: Text-to-image diffusion models fail to generalize to lower resolutions than trained on, preventing budget-efficient alternatives for users who don't need high-resolution images.

Method: Identifies unequal perceptual effects of noise schedulers across resolutions - same noise level removes more signal from lower-resolution images. Proposes NoiseShift to recalibrate denoiser noise level conditioned on resolution size.

Result: Significantly improves quality at low resolutions: SD3.5 improved by 15.89%, SD3 by 8.56%, Flux-Dev by 2.44% in FID on LAION-COCO; SD3.5 improved by 10.36%, SD3 by 5.19%, Flux-Dev by 3.02% on CelebA.

Conclusion: NoiseShift effectively mitigates resolution-dependent artifacts and enhances low-resolution image generation quality without requiring model retraining or architectural changes.

Abstract: Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

[214] Inferring Dynamic Physical Properties from Video Foundation Models

Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

Main category: cs.CV

TL;DR: This paper studies predicting dynamic physical properties from videos, focusing on elasticity, viscosity, and friction. It introduces new datasets and explores three methods: oracle using computer vision, prompt-based readout from pre-trained models, and MLLM prompting.

Details

Motivation: To enable prediction of dynamic physical properties that require temporal information from videos, addressing the challenge of inferring properties like elasticity, viscosity, and friction that evolve over time.

Method: Three approaches: (1) Oracle method using classical computer vision to extract visual cues, (2) Prompt-based readout mechanism with trainable vectors on pre-trained video models, (3) Prompt strategies for Multi-modal Large Language Models.

Result: Video foundation models achieve similar performance to each other but lag behind the oracle method. MLLMs currently perform worse but can be improved with better prompting strategies.

Conclusion: The study demonstrates the feasibility of predicting dynamic physical properties from videos, with video foundation models showing promising results, though MLLMs need further development for this task.

Abstract: We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.

[215] Clink! Chop! Thud! – Learning Object Sounds from Real-World Interactions

Mengyu Yang, Yiming Chen, Haozheng Pei, Siddhant Agarwal, Arun Balajee Vasudevan, James Hays

Main category: cs.CV

TL;DR: The paper introduces sounding object detection to link sounds to objects involved in interactions, using a multimodal object-aware framework trained on egocentric videos with automatic segmentation masks and slot attention.

Details

Motivation: To evaluate if models can distinguish sounds from object interactions (e.g., spoon on hardwood vs carpet) by linking sounds to specific objects, inspired by human perception.

Method: Multimodal object-aware framework using in-the-wild egocentric videos, automatic pipeline for object segmentation masks to guide training focus, and slot attention visual encoder to enforce object prior.

Result: Achieves state-of-the-art performance on the new sounding object detection task and existing multimodal action understanding tasks.

Conclusion: The proposed framework effectively links sounds to objects in interactions, demonstrating improved performance in multimodal understanding tasks.

Abstract: Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model’s ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model’s focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

[216] StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu

Main category: cs.CV

TL;DR: The paper analyzes 3D Gaussian Splatting (3DGS) vulnerabilities to poisoning attacks and proposes a density-guided poisoning method that injects Gaussian points into low-density regions using KDE, creating viewpoint-dependent illusory objects while maintaining stealth.

Details

Motivation: As 3D scene representation methods like NeRF and 3DGS become widely used, understanding and addressing their security vulnerabilities becomes critical, particularly against image-level poisoning attacks.

Method: Proposes a density-guided poisoning method using Kernel Density Estimation (KDE) to identify low-density regions for strategic Gaussian point injection, combined with an adaptive noise strategy to disrupt multi-view consistency.

Result: Extensive experiments show superior performance compared to state-of-the-art techniques, with the method effectively embedding viewpoint-dependent illusory objects visible from poisoned views while minimally affecting innocent views.

Conclusion: The paper demonstrates significant vulnerabilities in 3DGS to poisoning attacks and provides a systematic evaluation protocol using KDE for objective benchmarking of attack difficulty in future research.

Abstract: 3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method’s superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/

[217] Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

Main category: cs.CV

TL;DR: The paper introduces a theoretical framework and algorithms for improving multi-subject fidelity in text-to-image models by formulating subject disentanglement as stochastic optimal control over flow matching samplers.

Details

Motivation: Text-to-image models struggle with multi-subject prompts, showing attribute leakage, identity entanglement, and subject omissions, which motivates the need for principled solutions to improve multi-subject generation.

Method: Two architecture-agnostic algorithms: (1) training-free test-time controller that perturbs base velocity with single-pass update, and (2) Adjoint Matching - lightweight fine-tuning that regresses control network to backward adjoint signal while preserving base-model capabilities.

Result: Both algorithms consistently improve multi-subject alignment while maintaining base-model style across Stable Diffusion 3.5, FLUX, and Stable Diffusion XL. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers generalize to unseen prompts.

Conclusion: The proposed FOCUS (Flow Optimal Control for Unentangled Subjects) achieves state-of-the-art multi-subject fidelity across models, providing the first fine-tuning route explicitly designed for multi-subject generation.

Abstract: Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

[218] LiDAR-HMR: 3D Human Mesh Recovery from LiDAR

Bohao Fan, Wenzhao Zheng, Jianjiang Feng, Jie Zhou

Main category: cs.CV

TL;DR: First method to estimate 3D human body mesh from sparse LiDAR point clouds using sparse-to-dense reconstruction with graph transformers.

Details

Motivation: Address challenges of sparsity, noise, and incompletion in LiDAR point clouds for human pose and mesh estimation.

Method: Sparse-to-dense reconstruction scheme using cascaded graph transformer (Graphormer) to incorporate point cloud features.

Result: Experimental results on three public databases demonstrate the approach’s effectiveness.

Conclusion: Proposed method successfully reconstructs 3D human mesh from sparse LiDAR point clouds.

Abstract: In recent years, point cloud perception tasks have been garnering increasing attention. This paper presents the first attempt to estimate 3D human body mesh from sparse LiDAR point clouds. We found that the major challenge in estimating human pose and mesh from point clouds lies in the sparsity, noise, and incompletion of LiDAR point clouds. Facing these challenges, we propose an effective sparse-to-dense reconstruction scheme to reconstruct 3D human mesh. This involves estimating a sparse representation of a human (3D human pose) and gradually reconstructing the body mesh. To better leverage the 3D structural information of point clouds, we employ a cascaded graph transformer (graphormer) to introduce point cloud features during sparse-to-dense reconstruction. Experimental results on three publicly available databases demonstrate the effectiveness of the proposed approach. Code: https://github.com/soullessrobot/LiDAR-HMR/

[219] Meta-Transfer Derm-Diagnosis: Exploring Few-Shot Learning and Transfer Learning for Skin Disease Classification in Long-Tail Distribution

Zeynep Özdemir, Hacer Yalim Keles, Ömer Özgür Tanrıöver

Main category: cs.CV

TL;DR: This paper compares three learning strategies for rare skin disease classification: episodic learning, supervised transfer learning, and contrastive self-supervised pretraining. Traditional transfer learning with data augmentation achieves state-of-the-art performance.

Details

Motivation: Rare skin disease modeling faces challenges due to limited labeled data, long-tailed distribution of samples, and dataset inconsistencies. There's a need for effective few-shot learning approaches.

Method: Compared three learning strategies in few-shot framework: episodic learning, supervised transfer learning, and contrastive self-supervised pretraining. Evaluated five training setups on ISIC2018, Derm7pt, and SD-198 datasets using MobileNetV2 and ViT architectures with data augmentation techniques (MixUp, CutMix, ResizeMix).

Result: Traditional transfer learning approaches consistently outperform episodic and self-supervised methods as training examples increase. When combined with batch-level data augmentation, these models achieve state-of-the-art performance on SD-198 and Derm7pt, and competitive results on ISIC2018.

Conclusion: Transfer learning with data augmentation is the most effective approach for rare skin disease classification in few-shot settings, outperforming more complex episodic and self-supervised methods.

Abstract: Building accurate models for rare skin diseases remains challenging due to the lack of sufficient labeled data and the inherently long-tailed distribution of available samples. These issues are further complicated by inconsistencies in how datasets are collected and their varying objectives. To address these challenges, we compare three learning strategies: episodic learning, supervised transfer learning, and contrastive self-supervised pretraining, within a few-shot learning framework. We evaluate five training setups on three benchmark datasets: ISIC2018, Derm7pt, and SD-198. Our findings show that traditional transfer learning approaches, particularly those based on MobileNetV2 and Vision Transformer (ViT) architectures, consistently outperform episodic and self-supervised methods as the number of training examples increases. When combined with batch-level data augmentation techniques such as MixUp, CutMix, and ResizeMix, these models achieve state-of-the-art performance on the SD-198 and Derm7pt datasets, and deliver highly competitive results on ISIC2018. All the source codes related to this work will be made publicly available soon at the provided URL.

[220] DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut

Paul Couairon, Mustafa Shukor, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome

Main category: cs.CV

TL;DR: DiffCut is an unsupervised zero-shot segmentation method that uses diffusion UNet encoder features with a graph-based segmentation algorithm, outperforming previous state-of-the-art methods.

Details

Motivation: Foundation models are powerful across domains, but prior unsupervised image segmentation methods significantly lag behind supervised models. The authors aim to leverage diffusion features for better segmentation.

Method: Use diffusion UNet encoder as foundation vision encoder, extract features from final self-attention block, and apply recursive Normalized Cut algorithm for graph-based segmentation that regulates object granularity.

Result: Significantly outperforms previous state-of-the-art methods on zero-shot segmentation, producing well-defined segmentation maps that capture intricate image details.

Conclusion: Diffusion UNet encoders contain remarkably accurate semantic knowledge that can serve as foundation vision encoders for downstream tasks.

Abstract: Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks. Project page at https://diffcut-segmentation.github.io

[221] Interior Object Geometry via Fitted Frames

Stephen M. Pizer, Zhiyuan Liu, Junjie Zhao, Nicholas Tapp-Hughes, James Damon, Miaomiao Zhang, JS Marron, Mohsen Taheri, Jared Vicory

Main category: cs.CV

TL;DR: The paper proposes an evolutionary s-rep representation that uses fitted frames on object boundaries and interiors to create alignment-free geometric features with strong local correspondence across object populations, showing improved classification performance for hippocampal shape analysis.

Details

Motivation: To develop a geometric representation that provides strong locational correspondence across populations of anatomic objects, enabling powerful object statistics and improved classification performance.

Method: Represents objects as diffeomorphic deformations of ellipsoid interiors, uses skeletal representation fitted throughout the deformation to model target objects from boundary meshes, and computes fitted frames on boundaries and interiors.

Result: The evolutionary s-rep method shows notably improved classification performance compared to two state-of-the-art methods for distinguishing hippocampal shape between individuals with disorders vs. others.

Conclusion: The evolutionary s-rep representation with fitted frames provides superior geometric features for statistical analysis and classification of anatomic objects, particularly for hippocampal shape analysis.

Abstract: We propose a means of computing fitted frames on the boundary and in the interior of objects and using them to provide the basis for producing geometric features from them that are not only alignment-free but most importantly can be made to correspond locally across a population of objects. We describe a representation targeted for anatomic objects which is designed to enable this strong locational correspondence within object populations and thus to provide powerful object statistics. It accomplishes this by understanding an object as the diffeomorphic deformation of the closure of the interior of an ellipsoid and by using a skeletal representation fitted throughout the deformation to produce a model of the target object, where the object is provided initially in the form of a boundary mesh. Via classification performance on hippocampi shape between individuals with a disorder vs. others, we compare our method to two state-of-theart methods for producing object representations that are intended to capture geometric correspondence across a population of objects and to yield geometric features useful for statistics, and we show notably improved classification performance by this new representation, which we call the evolutionary s-rep. The geometric features that are derived from each of the representations, especially via fitted frames, are discussed.

[222] Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Yihong Tang, Ao Qu, Zhaokai Wang, Dingyi Zhuang, Zhaofeng Wu, Wei Ma, Shenhao Wang, Yunhan Zheng, Zhan Zhao, Jinhua Zhao

Main category: cs.CV

TL;DR: The paper introduces Sparkle, a framework that enhances 2D spatial reasoning in VLMs by training them on three core spatial capabilities: direction comprehension, distance estimation, and localization using synthetic data, which improves performance on complex spatial tasks.

Details

Motivation: Current vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. They give implausible answers to simple spatial problems that humans solve effortlessly.

Method: The authors disentangle 2D spatial reasoning into three core components (direction comprehension, distance estimation, and localization) and introduce Sparkle framework that generates synthetic data to provide targeted supervision across these capabilities, creating instruction datasets for fine-tuning VLMs.

Result: Experiments show that VLMs fine-tuned with Sparkle improve not only on basic spatial tasks but also on composite and out-of-distribution real-world spatial reasoning tasks.

Conclusion: Enhancing basic spatial skills through synthetic generalization effectively advances complex spatial reasoning and offers a systematic strategy for boosting the spatial understanding of VLMs.

Abstract: Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning, which is essential for navigation and interaction with physical environments. Many spatial reasoning tasks depend on fundamental two-dimensional (2D) skills, yet our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems, including simple pathfinding tasks that humans solve effortlessly. To address this, we enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities. We first disentangle 2D spatial reasoning into three core components: direction comprehension, distance estimation, and localization. We hypothesize that mastering these skills substantially improves performance on complex spatial tasks that require advanced reasoning and combinatorial problem solving, while also generalizing to real-world scenarios. To test this, we introduce Sparkle, a framework that generates synthetic data to provide targeted supervision across these three capabilities and yields an instruction dataset for each. Experiments show that VLMs fine-tuned with \emph{Sparkle} improve not only on basic tasks but also on composite and out-of-distribution real-world spatial reasoning tasks. These results indicate that enhancing basic spatial skills through synthetic generalization effectively advances complex spatial reasoning and offers a systematic strategy for boosting the spatial understanding of VLMs. Source codes of Sparkle are available at https://github.com/YihongT/Sparkle.

[223] There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models

Łukasz Staniszewski, Łukasz Kuciński, Kamil Deja

Main category: cs.CV

TL;DR: DDIM inversion in diffusion models creates structural patterns in latent encodings, limiting editability. The paper identifies issues with first inversion steps and proposes replacing them with forward diffusion to improve latent space quality.

Details

Motivation: Diffusion models lack low-dimensional latent spaces for editable features, and current inversion methods create structural patterns that reduce manipulation capabilities.

Method: Replace the first DDIM inversion steps with a forward diffusion process to decorrelate latent encodings and improve noise diversity.

Result: The proposed fix successfully decorrelates latent encodings, enabling higher quality image edits and interpolations compared to prior inversion methods.

Conclusion: Simple modification to DDIM inversion by incorporating forward diffusion steps resolves structural pattern issues and improves latent space editability.

Abstract: Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations. The code is available at https://github.com/luk-st/taba.

[224] What Makes a Good Dataset for Knowledge Distillation?

Logan Frank, Jim Davis

Main category: cs.CV

TL;DR: This paper explores using alternative datasets for knowledge distillation when the teacher’s original dataset is unavailable, finding that even synthetic imagery can work effectively.

Details

Motivation: Traditional knowledge distillation assumes access to the teacher's original dataset, but this isn't always possible in scenarios like continual learning or with proprietary datasets. The paper investigates what makes a good surrogate dataset for distillation.

Method: The authors explore multiple surrogate distillation datasets including various types of imagery, even synthetic/natural images, to identify criteria for effective knowledge transfer from teacher to student models.

Result: The research demonstrates that many different datasets, including unnatural synthetic imagery, can serve as suitable alternatives for knowledge distillation, challenging the assumption that only real in-domain data is viable.

Conclusion: The paper identifies specific criteria that characterize good datasets for distillation and shows that practitioners have more options than just real in-domain data when the original teacher dataset is unavailable.

Abstract: Knowledge distillation (KD) has been a popular and effective method for model compression. One important assumption of KD is that the teacher’s original dataset will also be available when training the student. However, in situations such as continual learning and distilling large models trained on company-withheld datasets, having access to the original data may not always be possible. This leads practitioners towards utilizing other sources of supplemental data, which could yield mixed results. One must then ask: “what makes a good dataset for transferring knowledge from teacher to student?” Many would assume that only real in-domain imagery is viable, but is that the only option? In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. From examining these alternative datasets, we identify and present various criteria describing what makes a good dataset for distillation. Source code is available at https://github.com/osu-cvl/good-kd-dataset.

[225] VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

Main category: cs.CV

TL;DR: VGoT is a framework that automates multi-shot video generation from single sentences by addressing narrative fragmentation, visual inconsistency, and transition artifacts through dynamic storyline modeling, identity-aware cross-shot propagation, and adjacent latent transition mechanisms.

Details

Motivation: Current video generation models produce disjointed multi-shot narratives with fractured storylines and visual inconsistencies, requiring extensive manual editing and lacking cross-scene continuity for movie-like content.

Method: VGoT uses dynamic storyline modeling to convert prompts into shot drafts with detailed specifications across five domains, identity-aware cross-shot propagation with IPP tokens for character consistency, and adjacent latent transition mechanisms for seamless shot transitions.

Result: VGoT surpasses baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while requiring 10x fewer manual adjustments compared to existing methods.

Conclusion: VGoT bridges the gap between raw visual synthesis and director-level storytelling, enabling automated multi-shot video generation with cohesive narratives and consistent visual quality.

Abstract: Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots’ features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

[226] Post-hoc Probabilistic Vision-Language Models

Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp

Main category: cs.CV

TL;DR: The paper proposes a post-hoc uncertainty estimation method for vision-language models that doesn’t require additional training, using Bayesian posterior approximation over the last layers to quantify uncertainties over cosine similarities.

Details

Motivation: Current VLMs use deterministic mappings that fail to capture uncertainties from domain shifts in downstream tasks, which is problematic for safety-critical applications.

Method: Leverages Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities without requiring additional training.

Result: The method achieves improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning compared to baselines.

Conclusion: The approach shows promise for safety-critical applications of large-scale models by providing reliable uncertainty quantification.

Abstract: Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

[227] DreamOmni: Unified Image Generation and Editing

Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: DreamOmni is a unified model for image generation and editing that integrates text-to-image models with various editing tasks through joint training, using a synthetic data pipeline for efficient dataset creation.

Details

Motivation: Current text-to-image models focus on generation quality but lack integration with downstream editing tasks, limiting their practical usability and synergistic benefits across different vision tasks.

Method: Proposes a unified framework that jointly trains text-to-image generation and editing tasks, using a synthetic data pipeline with sticker-like elements to efficiently create high-quality editing datasets for instruction-based and drag-based editing.

Result: Extensive experiments confirm the model’s effectiveness, with T2I training enhancing concept understanding and generation quality, while editing training improves the model’s grasp of editing nuances, significantly boosting editing performance.

Conclusion: DreamOmni successfully demonstrates that unified multitasking in computer vision can enhance model usability and foster synergistic benefits across generation and editing tasks, with code and model to be released.

Abstract: Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model’s understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.

Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: This paper presents a comprehensive survey of Dynamic Neural Networks for Computer Vision, providing a taxonomy based on adaptive components and highlighting their benefits for sensor fusion applications.

Details

Motivation: Static model compression techniques neglect input complexity variations, while Dynamic Neural Networks adapt computation to specific inputs, but current research is fragmented and needs unification.

Method: The authors conduct a systematic survey of existing Dynamic Neural Networks research, develop a logical taxonomy based on which network component is adaptive (output, computation graph, or input), and analyze applications in sensor fusion.

Result: The survey synthesizes and unifies fragmented research, provides a clear taxonomy for Dynamic Neural Networks, and demonstrates their benefits for sensor fusion including better adaptivity, noise reduction, and information prioritization.

Conclusion: Dynamic Neural Networks offer significant advantages over static compression methods by adapting computation to input complexity, and they show particular promise for sensor fusion applications. The authors provide a curated repository of surveyed papers with summaries and code links.

Abstract: Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction. We complement this survey with a curated repository listing all the surveyed papers, each with a brief summary of the solution and the code base when available: https://github.com/DTU-PAS/awesome-dynn-for-cv .

[229] Diffusion Adversarial Post-Training for One-Step Video Generation

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang

Main category: cs.CV

TL;DR: APT enables real-time one-step video generation by adversarially post-training diffusion models, achieving high-quality 1280x720 videos and 1024px images.

Details

Motivation: Diffusion models are slow due to iterative generation; existing distillation methods suffer quality degradation in one-step generation.

Method: Adversarial Post-Training (APT) with model architecture improvements, training procedure enhancements, and approximated R1 regularization.

Result: Seaweed-APT generates 2-second 1280x720 24fps videos in real time with single forward step; 1024px image quality comparable to SOTA.

Conclusion: APT successfully enables high-quality one-step video and image generation through adversarial post-training of diffusion models.

Abstract: The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.

[230] Multiple Queries with Multiple Keys: A Precise Prompt Matching Paradigm for Prompt-based Continual Learning

Dunwei Tu, Huiyu Yi, Yuchi Wang, Baile Xu, Jian Zhao, Furao Shen

Main category: cs.CV

TL;DR: Proposes MQMK prompt matching paradigm for continual learning, using multiple queries with multiple keys to improve prompt selection accuracy and reduce catastrophic forgetting.

Details

Motivation: Existing prompt-based continual learning methods suffer from low accuracy in prompt selection, leading to biased knowledge and predictions.

Method: Multiple Queries with Multiple Keys (MQMK) paradigm: Multiple queries enable precise breadth search with task-specific knowledge, multiple keys perform deep search with fine-grained feature distribution representation, and local matching reduces interference.

Result: Enhances prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three continual learning benchmarks.

Conclusion: MQMK effectively addresses prompt selection accuracy issues in continual learning through precise distribution matching between training and test data.

Abstract: Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Each query is designed to perform local matching with a designated task to reduce interference across queries. Experiments show that MQMK enhances the prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. The code is available at https://github.com/DunweiTu/MQMK.

[231] Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Main category: cs.CV

TL;DR: Latent Preference Optimization (LPO) uses diffusion models as step-level reward models in latent space, achieving better alignment with human preferences and 2.5-28x training speedup over existing methods.

Details

Motivation: Existing preference optimization methods use Vision-Language Models as pixel-level reward models, which struggle with noisy images at different timesteps and require complex pixel-space transformations.

Method: Proposed Latent Reward Model (LRM) that repurposes diffusion model components to predict preferences of latent images at arbitrary timesteps, and Latent Preference Optimization (LPO) for step-level optimization in noisy latent space.

Result: LPO significantly improves alignment with general, aesthetic, and text-image alignment preferences while achieving 2.5-28x training speedup over existing methods.

Conclusion: Pre-trained diffusion models are naturally suited for step-level reward modeling in latent space, enabling more efficient and effective preference optimization.

Abstract: Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model’s alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.

[232] L4P: Towards Unified Low-Level 4D Vision Perception

Abhishek Badki, Hang Su, Bowen Wen, Orazio Gallo

Main category: cs.CV

TL;DR: L4P is a unified feedforward architecture for low-level 4D perception tasks that uses a pre-trained ViT-based video encoder with lightweight task-specific heads, achieving competitive performance across multiple tasks simultaneously.

Details

Motivation: Current state-of-the-art methods use specialized architectures for each low-level 4D perception task, but spatio-temporal pixel relationships in videos contain critical information that a single general model should be able to leverage for multiple tasks.

Method: L4P combines a pre-trained ViT-based video encoder with lightweight per-task heads that require minimal training, enabling unified processing of multiple 4D perception tasks in a feedforward manner.

Result: The method achieves competitive performance with specialized approaches on both dense tasks (depth, optical flow estimation) and sparse tasks (2D/3D tracking), while solving all tasks simultaneously in comparable time to single-task methods.

Conclusion: L4P demonstrates that a general-purpose architecture can effectively solve multiple low-level 4D perception tasks without sacrificing performance or efficiency compared to specialized approaches.

Abstract: The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception tasks. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P, a feedforward, general-purpose architecture that solves low-level 4D perception tasks in a unified framework. L4P leverages a pre-trained ViT-based video encoder and combines it with per-task heads that are lightweight and therefore do not require extensive training. Despite its general and feedforward formulation, our method is competitive with existing specialized methods on both dense tasks, such as depth or optical flow estimation, and sparse tasks, such as 2D/3D tracking. Moreover, it solves all tasks at once in a time comparable to that of single-task methods.

[233] SCoT: Unifying Consistency Models and Rectified Flows via Straight-Consistent Trajectories

Zhangkai Wu, Xuhui Fan, Hongyu Wu, Longbing Cao

Main category: cs.CV

TL;DR: SCoT bridges consistency models and rectified flow methods to create straight and consistent trajectories for faster diffusion model sampling.

Details

Motivation: Existing methods have limitations: consistency models lack sampling efficiency, while rectified flow methods introduce approximation errors through numerical solvers.

Method: Proposes SCoT model that balances straight trajectories (constant gradient mapping) and trajectory consistency, combining benefits of both approaches.

Result: Extensive experiments show SCoT achieves effective and efficient sampling performance.

Conclusion: SCoT successfully bridges the gap between consistency models and rectified flow methods, enabling fast sampling with both straight and consistent trajectories.

Abstract: Pre-trained diffusion models are commonly used to generate clean data (e.g., images) from random noises, effectively forming pairs of noises and corresponding clean images. Distillation on these pre-trained models can be viewed as the process of constructing advanced trajectories within the pair to accelerate sampling. For instance, consistency model distillation develops consistent projection functions to regulate trajectories, although sampling efficiency remains a concern. Rectified flow method enforces straight trajectories to enable faster sampling, yet relies on numerical ODE solvers, which may introduce approximation errors. In this work, we bridge the gap between the consistency model and the rectified flow method by proposing a Straight Consistent Trajectory~(SCoT) model. SCoT enjoys the benefits of both approaches for fast sampling, producing trajectories with consistent and straight properties simultaneously. These dual properties are strategically balanced by targeting two critical objectives: (1) regulating the gradient of SCoT’s mapping to a constant, (2) ensuring trajectory consistency. Extensive experimental results demonstrate the effectiveness and efficiency of SCoT.

[234] How far can we go with ImageNet for Text-to-Image generation?

L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton

Main category: cs.CV

TL;DR: Challenges the ‘bigger is better’ paradigm in text-to-image generation by showing ImageNet with proper augmentations can outperform models trained on massive web-scraped datasets, using only 1/10th parameters and 1/1000th training images.

Details

Motivation: To challenge the established paradigm that prioritizes massive web-scraped datasets over data availability and reproducibility in text-to-image generation.

Method: Use ImageNet enhanced with well-designed text and image augmentations instead of massive web-scraped collections, and demonstrate finetuning on task-specific datasets.

Result: Achieved +6% overall score over SD-XL on GenEval and +5% on DPGBench while using 1/10th parameters and 1/1000th training images. Showed ImageNet pretrained models can be effectively finetuned for specific tasks.

Conclusion: ImageNet is sufficient for acquiring general capabilities in text-to-image generation, enabling more reproducible research with standardized training requiring only 500 H100 hours.

Abstract: Recent text-to-image (T2I) generation models have achieved remarkable sucess by training on billion-scale datasets, following a `bigger is better’ paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can achieve capabilities of models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a +6% overall score over SD-XL on GenEval and +5% on DPGBench while using just 1/10th the parameters and 1/1000th the training images. We also show that ImageNet pretrained models can be finetuned on task specific datasets (like for high resolution aesthetic applications) with good results, indicating that ImageNet is sufficient for acquiring general capabilities. This opens the way for more reproducible research as ImageNet is widely available and the proposed standardized training setup only requires 500 hours of H100 to train a text-to-image model.

[235] What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning

Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Elke Ruth Gizewski, Rainer Schubert

Main category: cs.CV

TL;DR: Developed an occlusion-based modality contribution method to measure importance of each modality in multimodal models, applied to medical problems, revealing modality preferences and dataset imbalances.

Details

Motivation: To understand how multimodal models process information from different sources, especially in medical applications where multimodal patient data is prevalent but model interpretability is underexplored.

Method: Implemented an occlusion-based modality contribution method that is model- and performance-agnostic, quantitatively measuring importance of each modality for model task completion.

Result: Found networks have modality preferences leading to unimodal collapses, some datasets are inherently imbalanced, and provided fine-grained quantitative and visual attribute importance for each modality.

Conclusion: The metric offers valuable insights for multimodal model development and dataset creation, contributing to interpretability in deep learning and facilitating integration of multimodal AI into clinical practice.

Abstract: Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Given the prevalence of high-dimensional, multimodal patient data in medicine, the development of multimodal models marks a significant advancement. However, how these models process information from individual sources in detail is still underexplored. Methods To this end, we implemented an occlusion-based modality contribution method that is both model- and performance-agnostic. This method quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we provide fine-grained quantitative and visual attribute importance for each modality. Conclusion Our metric offers valuable insights that can support the advancement of multimodal model development and dataset creation. By introducing this method, we contribute to the growing field of interpretability in deep learning for multimodal research. This approach helps to facilitate the integration of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.

[236] An Improved Pure Fully Connected Neural Network for Rice Grain Classification

Wanke Xia, Bo Lv, Xunwen Xiang, Ruoxin Peng, Haoqi Chu, Xinlei Zhu, Zhiyu Yang, Lili Yang

Main category: cs.CV

TL;DR: The paper proposes two enhancements to fully connected neural networks for rice grain classification: two-stage training and position correction preprocessing, improving accuracy from 97% to 99% for distinguishing similar rice varieties.

Details

Motivation: Classical deep learning models struggle to distinguish rice varieties with similar external characteristics, leading to misclassifications. The authors aim to improve classification accuracy while maintaining model transparency and feasibility.

Method: Used pure fully connected neural network with two key improvements: 1) Changed from one-stage to two-stage training to better distinguish similar rice types, 2) Modified preprocessing from random tilting to horizontal/vertical position correction. Dataset included global and domestic rice images from websites and laboratories.

Result: The proposed methods significantly improved classification accuracy from 97% to 99%, effectively enhancing the model’s ability to distinguish between similar rice varieties.

Conclusion: Two subtle methods - two-stage training and position correction preprocessing - can remarkably enhance the classification ability of deep learning models for rice grain classification, achieving high accuracy while maintaining model transparency.

Abstract: Rice is a staple food for a significant portion of the world’s population, providing essential nutrients and serving as a versatile in-gredient in a wide range of culinary traditions. Recently, the use of deep learning has enabled automated classification of rice, im-proving accuracy and efficiency. However, classical models based on first-stage training may face difficulties in distinguishing between rice varieties with similar external characteristics, thus leading to misclassifications. Considering the transparency and feasibility of model, we selected and gradually improved pure fully connected neural network to achieve classification of rice grain. The dataset we used contains both global and domestic rice images obtained from websites and laboratories respectively. First, the training mode was changed from one-stage training to two-stage training, which significantly contributes to distinguishing two similar types of rice. Secondly, the preprocessing method was changed from random tilting to horizontal or vertical position cor-rection. After those two enhancements, the accuracy of our model increased notably from 97% to 99%. In summary, two subtle methods proposed in this study can remarkably enhance the classification ability of deep learning models in terms of the classification of rice grain.

[237] Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang

Main category: cs.CV

TL;DR: TOP is a self-supervised pre-training method for LiDAR moving object segmentation that uses temporal overlapping points and occupancy prediction to learn spatiotemporal representations without manual annotations.

Details

Motivation: To reduce the heavy reliance on costly manual annotations for moving object segmentation in LiDAR point clouds by leveraging naturally captured temporal motion cues in LiDAR sequences for self-supervised learning.

Method: Proposes Temporal Overlapping Prediction (TOP) that learns spatiotemporal representations by predicting occupancy states of temporal overlapping points between current and adjacent scans, with current occupancy reconstruction as an auxiliary objective.

Result: TOP outperforms supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement on nuScenes and SemanticKITTI datasets, showing strong transferability across LiDAR setups.

Conclusion: The proposed self-supervised TOP method effectively reduces labeling burden for moving object segmentation while achieving superior performance and generalization capabilities across different LiDAR configurations and tasks.

Abstract: Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

[238] Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations

Stefan Sylvius Wagner, Stefan Harmeling

Main category: cs.CV

TL;DR: Self-supervised vision models like CLIP, DINOv2, and DINOv3 excel at geometric object properties but fail to capture surface-level attributes like color and texture, which are crucial for fine-grained object retrieval. The paper proposes using VAE-regularized latent spaces over segmented patches to recover these missing attributes.

Details

Motivation: Object-centric understanding is fundamental for complex reasoning tasks. As pre-trained representations are increasingly deployed in downstream applications like retrieval and manipulation, there's a need to ensure they can faithfully identify specific objects with fine-grained understanding.

Method: The authors investigate self-supervised representations (CLIP, DINOv2, DINOv3) and slot-based approaches for multi-object instance retrieval. They propose learning an auxiliary latent space over segmented patches with VAE regularization to enforce compact, disentangled object-centric representations that capture missing surface-level attributes.

Result: Self-supervised vision models and slot-based representations perform well on geometric properties (shape, size) but fail to preserve non-geometric surface-level cues (color, material, texture). Augmenting these methods with VAE-regularized latent spaces improves retrieval across all attributes.

Conclusion: Learning auxiliary latent spaces with VAE regularization over segmented patches can recover missing surface-level attributes in self-supervised representations, making them more reliable for downstream tasks requiring precise object-level reasoning.

Abstract: Object-centric understanding is fundamental to human vision and required for complex reasoning. Traditional methods define slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO have shown emergent object understanding. We investigate the effectiveness of self-supervised representations from models such as CLIP, DINOv2 and DINOv3, as well as slot-based approaches, for multi-object instance retrieval, where specific objects must be faithfully identified in a scene. This scenario is increasingly relevant as pre-trained representations are deployed in downstream tasks, e.g., retrieval, manipulation, and goal-conditioned policies that demand fine-grained object understanding. Our findings reveal that self-supervised vision models and slot-based representations excel at identifying edge-derived geometry (shape, size) but fail to preserve non-geometric surface-level cues (colour, material, texture), which are critical for disambiguating objects when reasoning about or selecting them in such tasks. We show that learning an auxiliary latent space over segmented patches, where VAE regularisation enforces compact, disentangled object-centric representations, recovers these missing attributes. Augmenting the self-supervised methods with such latents improves retrieval across all attributes, suggesting a promising direction for making self-supervised representations more reliable in downstream tasks that require precise object-level reasoning.

[239] One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

Main category: cs.CV

TL;DR: RSD is a new distillation method for ResShift that enables single-step super-resolution while maintaining high perceptual quality and better alignment with input images compared to other diffusion-based methods.

Details

Motivation: Existing diffusion models for super-resolution are computationally expensive, and current acceleration methods either fail to produce realistic details or hallucinate non-existent structures.

Method: Train student network to produce images that would make a fake ResShift model trained on them match the teacher model, enabling single-step restoration.

Result: RSD achieves single-step restoration, outperforms teacher by large margin, surpasses SinSR, and produces competitive perceptual quality with better alignment and fewer parameters than text-to-image based methods.

Conclusion: RSD provides an effective distillation approach for diffusion-based super-resolution that balances computational efficiency with high-quality results across various datasets.

Abstract: Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

[240] RGS-DR: Deferred Reflections and Residual Shading in 2D Gaussian Splatting

Georgios Kouros, Minye Wu, Tinne Tuytelaars

Main category: cs.CV

TL;DR: The paper introduces a refinement stage for inverse rendering using 2D Gaussian splatting with deferred shading to improve specular detail, bridging the gap with reconstruction-only methods.

Details

Motivation: To address the limitations of per-Gaussian shading with shortest-axis normals and normal residuals, which often result in noisy geometry and specular appearance, and to improve the quality of specular highlights and material editability.

Method: The pipeline uses 2D Gaussian splatting with deferred shading, estimates editable material properties and environment illumination, and employs a directional residual pass for capturing view-dependent effects. A pixel-deferred surfel formulation with specular residuals is used instead of per-Gaussian shading.

Result: The approach yields sharper highlights, cleaner materials, improved editability, and is evaluated on three popular datasets featuring glossy objects, demonstrating high-quality relighting and material editing.

Conclusion: The proposed method effectively improves specular appearance in inverse rendering, providing better reconstruction quality and editability compared to previous approaches.

Abstract: In this work, we address specular appearance in inverse rendering using 2D Gaussian splatting with deferred shading and argue for a refinement stage to improve specular detail, thereby bridging the gap with reconstruction-only methods. Our pipeline estimates editable material properties and environment illumination while employing a directional residual pass that captures leftover view-dependent effects for further refining novel view synthesis. In contrast to per-Gaussian shading with shortest-axis normals and normal residuals, which tends to result in more noisy geometry and specular appearance, a pixel-deferred surfel formulation with specular residuals yields sharper highlights, cleaner materials, and improved editability. We evaluate our approach on rendering and reconstruction quality on three popular datasets featuring glossy objects, and also demonstrate high-quality relighting and material editing.

[241] LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition

Zhangshuo Qi, Luqi Cheng, Zijie Zhou, Guangming Xiong

Main category: cs.CV

TL;DR: LRFusionPR is a LiDAR-radar fusion method for place recognition in autonomous driving that improves accuracy and robustness through cross-modality feature interaction and knowledge distillation.

Details

Motivation: Place recognition is critical for GPS-denied autonomous driving. While LiDAR provides precise ranging and radar excels in adverse weather, effectively fusing them is challenging due to radar's noisy/sparse data and heterogeneous configurations.

Method: A dual-branch network fuses LiDAR with single-chip or scanning radar in unified polar coordinate BEV. Cross-attention enables cross-modality feature interaction in fusion branch, with knowledge transferred to radar-only distillation branch for robustness. Both branches’ descriptors are concatenated for place retrieval.

Result: Extensive evaluations on multiple datasets show LRFusionPR achieves accurate place recognition while maintaining robustness under varying weather conditions.

Conclusion: LRFusionPR successfully addresses LiDAR-radar fusion challenges for place recognition, providing accurate and weather-robust performance suitable for autonomous driving applications.

Abstract: In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird’s eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.

[242] Fusing Foveal Fixations Using Linear Retinal Transformations and Bayesian Experimental Design

Christopher K. I. Williams

Main category: cs.CV

TL;DR: The paper presents a Bayesian approach for fusing multiple retinal fixations into a complete scene representation, modeling retinal transformations as linear downsampling and solving “where to look next” as a Bayesian experimental design problem.

Details

Motivation: Humans and vertebrates need to fuse multiple fixations with varying resolution (high-resolution fovea, decreasing peripheral resolution) to form complete scene representations, which poses a computational challenge.

Method: Represent retinal transformation as linear downsampling of a high-resolution latent image, enabling exact inference in factor analysis and mixture models, and formulate fixation selection as Bayesian experimental design using Expected Information Gain.

Result: Experiments on Frey faces and MNIST datasets demonstrate the effectiveness of the proposed models for scene representation and fixation selection.

Conclusion: The linear retinal transformation representation enables exact Bayesian inference and optimal fixation selection through information-theoretic criteria, providing a principled computational framework for visual scene understanding.

Abstract: Humans (and many vertebrates) face the problem of fusing together multiple fixations of a scene in order to obtain a representation of the whole, where each fixation uses a high-resolution fovea and decreasing resolution in the periphery. In this paper we explicitly represent the retinal transformation of a fixation as a linear downsampling of a high-resolution latent image of the scene, exploiting the known geometry. This linear transformation allows us to carry out exact inference for the latent variables in factor analysis (FA) and mixtures of FA models of the scene. Further, this allows us to formulate and solve the choice of “where to look next” as a Bayesian experimental design problem using the Expected Information Gain criterion. Experiments on the Frey faces and MNIST datasets demonstrate the effectiveness of our models.

[243] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Abdelrahman Eldesokey, Peter Wonka, Gabriel Brostow, Sara Vicente, Guillermo Garcia-Hernando

Main category: cs.CV

TL;DR: A new task called Language-Guided Object Placement in Real 3D Scenes is introduced, where models must place 3D assets in scenes based on textual prompts, requiring geometric reasoning about free space.

Details

Motivation: To address the challenge of ambiguous object placement in 3D scenes guided by natural language, which has multiple valid solutions and requires understanding 3D geometric relationships.

Method: Proposed a new benchmark with evaluation protocol, created a dataset for training 3D LLMs, and developed the first non-trivial baseline method for this task.

Result: Established a new challenging benchmark for evaluating 3D LLM capabilities in language-guided object placement tasks.

Conclusion: This task and benchmark could become part of the standard evaluation suite for comparing generalist 3D LLM models.

Abstract: We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene’s point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.

[244] Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings

Arjhun Swaminathan, Mete Akgün

Main category: cs.CV

TL;DR: TEA is a novel black-box targeted adversarial attack that uses edge information from target images to create effective perturbations with significantly fewer queries than state-of-the-art methods.

Details

Motivation: Current black-box targeted attacks are challenging due to narrow decision regions and often rely on geometric properties rather than incorporating image-specific information like edges.

Method: The proposed TEA attack utilizes edge information from the target image to carefully perturb it, creating adversarial examples that are closer to the source image while achieving target classification.

Result: TEA outperforms state-of-the-art methods across different models in low query settings, using nearly 70% fewer queries, and provides improved target initialization for geometry-based attacks.

Conclusion: Edge-informed attacks like TEA are highly effective for black-box targeted adversarial attacks, especially in query-limited real-world scenarios.

Abstract: Deep neural networks for image classification remain vulnerable to adversarial examples – small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.

[245] GARLIC: GAussian Representation LearnIng for spaCe partitioning

Panagiotis Rigas, Panagiotis Drivas, Charalambos Tzamos, Ioannis Chamodrakas, George Ioannakis, Leonidas J. Guibas, Ioannis Z. Emiris

Main category: cs.CV

TL;DR: GARLIC is a geometry-aware ANN search method that uses anisotropic Gaussian cells aligned with local data geometry and adaptive to data density, improving efficiency with small probe budgets.

Details

Motivation: Existing ANN methods use isotropic cells or fixed resolutions that fragment dense regions and merge unrelated points in sparse areas, increasing candidate counts during search.

Method: Partitions space into anisotropic Gaussian cells aligned with local geometry, uses information-theoretic objectives for coverage/overlap/alignment balance, and employs split/clone refinement. Query time uses Mahalanobis distance for cell selection and localized quantization for pruning.

Result: Reduces cross-cell neighbor splits and candidate counts under small probe budgets, remains robust even when trained on small dataset fractions, and offers competitive recall-efficiency trade-offs.

Conclusion: GARLIC introduces a geometry-aware space-partitioning paradigm combining information-theoretic objectives with adaptive density refinement for improved Euclidean ANN search.

Abstract: We present \textbf{GARLIC}, a representation learning approach for Euclidean approximate nearest neighbor (ANN) search in high dimensions. Existing partitions tend to rely on isotropic cells, fixed global resolution, or balanced constraints, which fragment dense regions and merge unrelated points in sparse ones, thereby increasing the candidate count when probing only a few cells. Our method instead partitions (\mathbb{R}^d) into anisotropic Gaussian cells whose shapes align with local geometry and sizes adapt to data density. Information-theoretic objectives balance coverage, overlap, and geometric alignment, while split/clone refinement introduces Gaussians only where needed. At query time, Mahalanobis distance selects relevant cells and localized quantization prunes candidates. This yields partitions that reduce cross-cell neighbor splits and candidate counts under small probe budgets, while remaining robust even when trained on only a small fraction of the dataset. Overall, GARLIC introduces a geometry-aware space-partitioning paradigm that combines information-theoretic objectives with adaptive density refinement, offering competitive recall–efficiency trade-offs for Euclidean ANN search.

[246] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang

Main category: cs.CV

TL;DR: AAPT transforms pre-trained latent video diffusion models into real-time, interactive video generators using autoregressive adversarial post-training, achieving 24fps streaming at 736x416 resolution on single H100.

Details

Motivation: Existing large-scale video generation models are computationally intensive and not suitable for real-time interactive applications, preventing adoption in scenarios requiring immediate user feedback.

Method: Autoregressive adversarial post-training (AAPT) that generates one latent frame at a time using single neural function evaluation (1NFE), utilizes KV cache efficiently, and employs student-forcing training to reduce error accumulation.

Result: 8B model achieves real-time 24fps streaming video generation at 736x416 resolution on single H100 GPU, or 1280x720 on 8xH100 for up to 1440 frames (1 minute long).

Conclusion: Adversarial training is an effective paradigm for autoregressive generation, enabling real-time interactive video generation while maintaining quality and reducing computational requirements.

Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

[247] LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith

Main category: cs.CV

TL;DR: Legato is a new end-to-end optical music recognition model that can process full-page or multi-page typeset music scores and generate ABC notation, outperforming previous state-of-the-art methods with significant error reductions.

Details

Motivation: To create the first large-scale pretrained OMR model capable of recognizing full-page/multi-page typeset music scores and generating documents in human-readable ABC notation, addressing limitations in existing OMR systems.

Method: Combines a pretrained vision encoder with an ABC decoder trained on a large dataset of over 214K images, creating an end-to-end model for optical music recognition.

Result: Outperforms previous state-of-the-art methods with 68% absolute error reduction on TEDn and 47.6% on OMR-NED metrics on realistic datasets, demonstrating strong generalization across various typeset scores.

Conclusion: Legato represents a significant advancement in optical music recognition, being the first model to handle full-page/multi-page scores and generate ABC notation while achieving substantial performance improvements over existing methods.

Abstract: We propose Legato, a new end-to-end model for optical music recognition (OMR), a task of converting music score images to machine-readable documents. Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct comprehensive experiments on a range of datasets and metrics and demonstrate that Legato outperforms the previous state of the art. On our most realistic dataset, we see a 68% and 47.6% absolute error reduction on the standard metrics TEDn and OMR-NED, respectively.

[248] Concept Unlearning by Modeling Key Steps of Diffusion Process

Chaoshuo Zhang, Chenhao Lin, Zhengyu Zhao, Le Yang, Qian Wang, Chao Shen

Main category: cs.CV

TL;DR: KSCU is a concept unlearning method for text-to-image diffusion models that selectively fine-tunes key denoising steps to prevent harmful content generation while preserving image quality.

Details

Motivation: Text-to-image diffusion models are prone to misuse for generating harmful content, and existing concept unlearning methods struggle to balance unlearning effectiveness with generation quality preservation.

Method: Key Step Concept Unlearning (KSCU) selectively fine-tunes the model at key denoising steps where the target concept is most influential, avoiding uniform treatment of all steps.

Result: On I2P dataset, KSCU outperforms ESD by 8.3% in nudity unlearning accuracy while improving FID by 8.4%, achieving overall score of 0.92 and surpassing other SOTA methods.

Conclusion: KSCU effectively addresses the trade-off between unlearning effectiveness and generation quality by focusing on key denoising steps, providing higher efficiency and effectiveness than previous approaches.

Abstract: Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used, but their flexibility also makes them prone to misuse for producing harmful or unsafe content. Concept unlearning has been used to prevent text-to-image diffusion models from being misused to generate undesirable visual content. However, existing methods struggle to trade off unlearning effectiveness with the preservation of generation quality. To address this limitation, we propose Key Step Concept Unlearning (KSCU), which selectively fine-tunes the model at key steps to the target concept. KSCU is inspired by the fact that different diffusion denoising steps contribute unequally to the final generation. Compared to previous approaches, which treat all denoising steps uniformly, KSCU avoids over-optimization of unnecessary steps for higher effectiveness and reduces the number of parameter updates for higher efficiency. For example, on the I2P dataset, KSCU outperforms ESD by 8.3% in nudity unlearning accuracy while improving FID by 8.4%, and achieves a high overall score of 0.92, substantially surpassing all other SOTA methods.

[249] VITA: Vision-to-Action Flow Matching Policy

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani

Main category: cs.CV

TL;DR: VITA is a noise-free, conditioning-free policy learning framework that directly maps visual representations to latent actions using flow matching, achieving faster inference while maintaining performance.

Details

Motivation: To reduce the complexity of conventional flow matching and diffusion-based policies that require iterative denoising and conditioning mechanisms, which incur substantial time and memory overhead.

Method: Uses flow matching with visual representations as source, introduces action autoencoder to map raw actions to structured latent space aligned with visual latents, and employs flow latent decoding to prevent latent space collapse by backpropagating action reconstruction loss through ODE solving steps.

Result: VITA outperforms or matches state-of-the-art generative policies on 8 simulation and 2 real-world tasks from ALOHA and Robomimic, while achieving 1.5-2.3x faster inference compared to conventional methods with conditioning.

Conclusion: VITA provides an efficient alternative to conventional generative policies by eliminating noise sampling and conditioning requirements while maintaining competitive performance.

Abstract: Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning mechanisms to incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free policy learning framework that directly maps visual representations to latent actions using flow matching. VITA treats latent visual representations as the source of the flow, thus eliminating the need of conditioning. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equations) solving steps. We evaluate VITA on 8 simulation and 2 real-world tasks from ALOHA and Robomimic. VITA outperforms or matches state-of-the-art generative policies, while achieving 1.5-2.3x faster inference compared to conventional methods with conditioning. Project page: https://ucd-dare.github.io/VITA/

[250] Efficient Whole Slide Pathology VQA via Token Compression

Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen

Main category: cs.CV

TL;DR: TCP-LLaVA is a novel MLLM architecture that uses token compression to efficiently handle whole-slide images for visual question answering, reducing computational costs while improving accuracy.

Details

Motivation: Whole-slide images are extremely large (up to 10,000x10,000 pixels), creating computational challenges for MLLMs. Existing methods either lack generative capabilities for VQA or consume excessive resources by processing thousands of patch tokens directly.

Method: Proposes TCP-LLaVA with trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by BERT’s [CLS] token. Only compressed tokens are forwarded to the LLM for answer generation.

Result: Experiments on ten TCGA tumor subtypes show TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while substantially reducing training resource consumption.

Conclusion: Token compression effectively addresses the computational challenges of WSI analysis in MLLMs, enabling efficient VQA while maintaining high accuracy.

Abstract: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

[251] StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Linxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Boqian Wang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak

Main category: cs.CV

TL;DR: StreamAgent enables proactive video understanding by anticipating future task-relevant information through temporal and spatial prediction, using a streaming KV-cache for efficient real-time processing.

Details

Motivation: Existing real-time video understanding methods lack proactive decision making and future anticipation, limiting responsiveness in dynamic streaming scenarios like autonomous driving and surveillance.

Method: Integrates question semantics and historical observations to anticipate key events, aligns current observations with expected future evidence, and uses a streaming KV-cache memory mechanism for efficient token recall.

Result: Outperforms existing methods in response accuracy and real-time efficiency on streaming and long video understanding tasks.

Conclusion: The proposed StreamAgent provides practical value for real-world streaming scenarios by enabling proactive, goal-driven responses through anticipation and efficient memory management.

Abstract: Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.

[252] Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Yansheng Gao, Yufei Zheng, Shengsheng Wang

Main category: cs.CV

TL;DR: ANPrompt is a robust prompt tuning framework that actively incorporates weak semantic noise through noise prompts and NRVPP, achieving better robustness and generalization than existing methods.

Details

Motivation: Existing prompt tuning methods suppress semantic noise, which limits robustness and generalization to unseen categories. Complete removal of semantic noise restricts model performance.

Method: ANPrompt clusters weakly perturbed features into noise prompts and integrates them with learnable tokens in text and vision encoders. Uses NRVPP to stabilize visual semantics and WALoss for consistency between clean and perturbed predictions.

Result: Extensive experiments on 11 benchmarks show ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

Conclusion: Actively incorporating weak semantic noise with controlled exposure and logits-based consistency prevents overfitting while preserving semantic integrity, leading to better robustness and generalization.

Abstract: Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model’s robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

[253] Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee

Main category: cs.CV

TL;DR: AutoEval framework for object detection that uses Prediction Consistency and Reliability (PCR) to estimate performance without ground-truth labels by analyzing spatial consistency and confidence reliability of bounding boxes before/after NMS.

Details

Motivation: Manual annotation for evaluating object detectors is costly and time-consuming, creating a need for automated evaluation methods that can assess performance without ground-truth labels.

Method: Proposes PCR metric that measures spatial consistency between boxes before/after NMS and reliability of retained boxes via confidence scores of overlapping boxes. Uses meta-dataset with varying image corruptions for realistic evaluation.

Result: PCR provides more accurate performance estimates than existing AutoEval methods, and the meta-dataset covers wider performance range than previous approaches.

Conclusion: The AutoEval framework with PCR enables efficient and effective evaluation of object detectors without manual annotation, offering practical solution for real-world applications.

Abstract: Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.

[254] DiCache: Let Diffusion Model Determine Its Own Cache

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: DiCache is a training-free adaptive caching strategy that dynamically determines when and how to cache features in diffusion models using shallow-layer feature analysis to improve efficiency and fidelity.

Details

Motivation: Existing caching methods for diffusion models rely on predefined rules and dataset-level priors, which lack generalizability and fail to handle diverse samples effectively due to the dynamic nature of diffusion processes.

Method: DiCache uses two components: (1) Online Probe Profiling Scheme that uses shallow-layer features to dynamically determine caching timing per sample, and (2) Dynamic Cache Trajectory Alignment that approximates deep-layer features from multi-step historical caches based on shallow-layer feature trajectories.

Result: Extensive experiments show DiCache achieves higher efficiency and improved fidelity compared to state-of-the-art approaches on various diffusion models including WAN 2.1, HunyuanVideo and Flux.

Conclusion: DiCache provides a unified framework for adaptive caching in diffusion models that dynamically customizes caching strategies per sample, leading to better performance across diverse models and samples.

Abstract: Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: “When to cache” and “How to use cache”, typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) Dynamic Cache Trajectory Alignment adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCache’s capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.

[255] Enhancing Corpus Callosum Segmentation in Fetal MRI via Pathology-Informed Domain Randomization

Marina Grifell i Plana, Vladyslav Zalevskyi, Léa Schmidt, Yvan Gomez, Thomas Sanchez, Vincent Dunet, Mériam Koob, Vanessa Siffredi, Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: Proposes pathology-informed domain randomization for fetal brain segmentation in rare corpus callosum dysgenesis (CCD), using synthetic data generation from healthy data to overcome annotation scarcity.

Details

Motivation: Accurate fetal brain segmentation is crucial for neurodevelopment assessment, but CCD rarity severely limits annotated data, hindering deep learning model generalization.

Method: Pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into synthetic data generation pipeline, simulating diverse brain alterations from healthy data alone.

Result: Validated on 248 healthy, 26 CCD, and 47 other pathology cases; reduced LCC estimation error from 1.89mm to 0.80mm (healthy) and 10.9mm to 0.7mm (CCD); improved topological consistency in segmentations.

Conclusion: Incorporating domain-specific anatomical priors into synthetic data pipelines effectively mitigates data scarcity and enhances analysis of rare but clinically significant malformations.

Abstract: Accurate fetal brain segmentation is crucial for extracting biomarkers and assessing neurodevelopment, especially in conditions such as corpus callosum dysgenesis (CCD), which can induce drastic anatomical changes. However, the rarity of CCD severely limits annotated data, hindering the generalization of deep learning models. To address this, we propose a pathology-informed domain randomization strategy that embeds prior knowledge of CCD manifestations into a synthetic data generation pipeline. By simulating diverse brain alterations from healthy data alone, our approach enables robust segmentation without requiring pathological annotations. We validate our method on a cohort comprising 248 healthy fetuses, 26 with CCD, and 47 with other brain pathologies, achieving substantial improvements on CCD cases while maintaining performance on both healthy fetuses and those with other pathologies. From the predicted segmentations, we derive clinically relevant biomarkers, such as corpus callosum length (LCC) and volume, and show their utility in distinguishing CCD subtypes. Our pathology-informed augmentation reduces the LCC estimation error from 1.89 mm to 0.80 mm in healthy cases and from 10.9 mm to 0.7 mm in CCD cases. Beyond these quantitative gains, our approach yields segmentations with improved topological consistency relative to available ground truth, enabling more reliable shape-based analyses. Overall, this work demonstrates that incorporating domain-specific anatomical priors into synthetic data pipelines can effectively mitigate data scarcity and enhance analysis of rare but clinically significant malformations.

[256] TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models

Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin

Main category: cs.CV

TL;DR: A visual token pruning method for Large Multimodal Models that reduces computational costs by selectively removing redundant visual tokens while preserving cross-modal alignment and intra-modal diversity.

Details

Motivation: Large Multimodal Models suffer from high computational and memory costs due to increased token counts from visual inputs. Existing token pruning methods have limitations like costly calibration and suboptimal importance metrics.

Method: Proposes pruning only visual tokens using mutual information to preserve cross-modal alignment and maximizing pairwise distances in embedding space to maintain intra-modal diversity. Uses greedy algorithm for efficient implementation.

Result: Achieves 88.9% token reduction on LLaVA-1.5-7B and LLaVA-NEXT-7B models while maintaining strong performance, with 56.7% improvement in inference speed.

Conclusion: The proposed visual token pruning strategy effectively reduces computational overhead while preserving model performance by focusing on cross-modal alignment and intra-modal diversity.

Abstract: Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.

[257] Towards Methane Detection Onboard Satellites

Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini

Main category: cs.CV

TL;DR: ML models trained on unorthorectified satellite data achieve comparable methane detection performance to orthorectified data, enabling faster onboard processing with reduced preprocessing steps.

Details

Motivation: Methane is a potent greenhouse gas requiring timely detection for climate mitigation. Current methods need preprocessing like orthorectification, which adds complexity and delays.

Method: Proposed UnorthoDOS approach using unorthorectified hyperspectral data from EMIT sensor, bypassing conventional preprocessing steps. Also trained models on orthorectified data for comparison.

Result: ML models on unorthorectified data perform comparably to orthorectified data models. Orthorectified models outperform matched filter baseline (mag1c).

Conclusion: Unorthorectified data enables effective methane detection with reduced preprocessing, supporting faster onboard satellite processing for rapid response systems.

Abstract: Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

[258] RS-OOD: A Vision-Language Augmented Framework for Out-of-Distribution Detection in Remote Sensing

Chenhao Wang, Yingrui Ji, Yu Meng, Yunjian Zhang, Yao Zhu

Main category: cs.CV

TL;DR: RS-OOD is a novel framework for out-of-distribution detection in remote sensing that uses vision-language modeling with spatial feature enhancement, dual-prompt alignment, and confidence-guided self-training to achieve robust few-shot performance.

Details

Motivation: Existing OOD detection methods are poorly suited for remote sensing due to data scarcity, complex multi-scale scene structures, and pronounced distribution shifts, creating a critical need for specialized solutions.

Method: Leverages remote sensing-specific vision-language modeling with spatial feature enhancement, dual-prompt alignment for spatial-semantic consistency, and confidence-guided self-training to dynamically mine pseudo-labels.

Result: Consistently outperforms existing methods across multiple remote sensing benchmarks and enables efficient adaptation with minimal labeled data.

Conclusion: Demonstrates the critical value of spatial-semantic integration for robust OOD detection in remote sensing applications.

Abstract: Out-of-distribution (OOD) detection represents a critical challenge in remote sensing applications, where reliable identification of novel or anomalous patterns is essential for autonomous monitoring, disaster response, and environmental assessment. Despite remarkable progress in OOD detection for natural images, existing methods and benchmarks remain poorly suited to remote sensing imagery due to data scarcity, complex multi-scale scene structures, and pronounced distribution shifts. To this end, we propose RS-OOD, a novel framework that leverages remote sensing-specific vision-language modeling to enable robust few-shot OOD detection. Our approach introduces three key innovations: spatial feature enhancement that improved scene discrimination, a dual-prompt alignment mechanism that cross-verifies scene context against fine-grained semantics for spatial-semantic consistency, and a confidence-guided self-training loop that dynamically mines pseudo-labels to expand training data without manual annotation. RS-OOD consistently outperforms existing methods across multiple remote sensing benchmarks and enables efficient adaptation with minimal labeled data, demonstrating the critical value of spatial-semantic integration.

[259] Patch-Level Kernel Alignment for Dense Self-Supervised Learning

Juan Yeo, Ijun Jang, Taesup Kim

Main category: cs.CV

TL;DR: PaKA is a non-parametric, kernel-based post-training method that improves dense representations of pretrained vision encoders using statistical dependency alignment and refined augmentations, achieving SOTA performance with minimal additional training.

Details

Motivation: Existing dense self-supervised learning methods rely on parametric assumptions or complex post-processing, limiting flexibility and stability.

Method: Introduces Patch-level Kernel Alignment (PaKA) - a non-parametric kernel-based approach with statistical dependency alignment and refined dense SSL augmentations, applied as lightweight post-training on pretrained models.

Result: Achieves state-of-the-art performance across dense vision benchmarks with only 14 hours of additional training on a single GPU.

Conclusion: PaKA provides an efficient and effective framework for enhancing dense representations in vision models through non-parametric kernel alignment and optimized augmentations.

Abstract: Dense self-supervised learning (SSL) methods showed its effectiveness in enhancing the fine-grained semantic understandings of vision models. However, existing approaches often rely on parametric assumptions or complex post-processing (e.g., clustering, sorting), limiting their flexibility and stability. To overcome these limitations, we introduce Patch-level Kernel Alignment (PaKA), a non-parametric, kernel-based approach that improves the dense representations of pretrained vision encoders with a post-(pre)training. Our method propose a robust and effective alignment objective that captures statistical dependencies which matches the intrinsic structure of high-dimensional dense feature distributions. In addition, we revisit the augmentation strategies inherited from image-level SSL and propose a refined augmentation strategy for dense SSL. Our framework improves dense representations by conducting a lightweight post-training stage on top of a pretrained model. With only 14 hours of additional training on a single GPU, our method achieves state-of-the-art performance across a range of dense vision benchmarks, demonstrating both efficiency and effectiveness.

[260] Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement

Yan Xingyang, Huang Xiaohong, Zhang Zhao, You Tian, Xu Ziheng

Main category: cs.CV

TL;DR: LLFDisc is a U-shaped deep enhancement network that uses cross-attention and gating mechanisms for frequency-aware enhancement, with novel distribution-aware loss in Fourier domain and enhanced perceptual loss.

Details

Motivation: Traditional Fourier frequency fitting with pixel-wise losses focuses too much on local information and causes global information loss. Need better Fourier-domain alignment.

Method: U-shaped deep network with cross-attention and gating mechanisms. Uses distribution-aware loss with KL-Divergence for Fourier-domain fitting and enhanced VGG perceptual loss with KL-Divergence on deep features.

Result: Achieves state-of-the-art performance across multiple benchmarks in both qualitative and quantitative evaluations.

Conclusion: LLFDisc effectively enhances frequency-aware image processing through Fourier-domain distribution alignment and improved structural fidelity.

Abstract: In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc

[261] Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji

Yadvendra Gurjar, Ruoni Wan, Ehsan Farahbakhsh, Rohitash Chandra

Main category: cs.CV

TL;DR: Machine learning and remote sensing frameworks were used to analyze land use/land cover changes in Nadi, Fiji from 2013 to 2024, focusing on urbanization impacts.

Details

Motivation: Fiji is experiencing rapid urbanization with massive development projects, creating a need for technical support in land cover/land use modeling and change detection.

Method: Used Landsat-8 satellite imagery with supervised machine learning training datasets, Google Earth Engine with k-means clustering for land cover mapping, and convolutional neural networks for land cover classification.

Result: Generated land cover maps and visualizations showing urban area changes over time, enabling monitoring of urbanization patterns.

Conclusion: The study successfully developed frameworks for monitoring land use/land cover changes, providing valuable tools for tracking urbanization impacts in developing regions like Fiji.

Abstract: As a developing country, Fiji is facing rapid urbanisation, which is visible in the massive development projects that include housing, roads, and civil works. In this study, we present machine learning and remote sensing frameworks to compare land use and land cover change from 2013 to 2024 in Nadi, Fiji. The ultimate goal of this study is to provide technical support in land cover/land use modelling and change detection. We used Landsat-8 satellite image for the study region and created our training dataset with labels for supervised machine learning. We used Google Earth Engine and unsupervised machine learning via k-means clustering to generate the land cover map. We used convolutional neural networks to classify the selected regions’ land cover types. We present a visualisation of change detection, highlighting urban area changes over time to monitor changes in the map.

[262] GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

Main category: cs.CV

TL;DR: GenExam is the first benchmark for multidisciplinary text-to-image exams with 1,000 samples across 10 subjects, designed to rigorously evaluate models’ ability to integrate understanding, reasoning, and generation through exam-style prompts.

Details

Motivation: Existing benchmarks focus on understanding/reasoning or world knowledge illustration, but neglect rigorous drawing exams that test integrated capabilities needed for expert-level intelligence.

Method: Created a benchmark with 1,000 exam-style prompts across 10 subjects organized under a four-level taxonomy, each with ground-truth images and fine-grained scoring points for precise evaluation of semantic correctness and visual plausibility.

Result: State-of-the-art models like GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, with most models scoring almost 0%, demonstrating the benchmark’s significant challenge.

Conclusion: GenExam provides a rigorous assessment of models’ integrated understanding, reasoning, and generation capabilities, offering insights for advancing toward general AGI.

Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models’ ability to integrate understanding, reasoning, and generation, providing insights on the path to general AGI. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

[263] EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, Linlin Shen

Main category: cs.CV

TL;DR: EyePCR is a large-scale benchmark for evaluating MLLMs in ophthalmic surgery analysis across Perception, Comprehension, and Reasoning, featuring 210k VQAs and domain-adapted models that rival commercial solutions.

Details

Motivation: To address the under-explored performance of MLLMs in high-stakes, domain-specific surgical settings by creating a comprehensive benchmark for ophthalmic surgery analysis.

Method: Developed EyePCR benchmark with 210k VQAs covering 1048 fine-grained attributes, 25k medical knowledge graph triplets, and four clinical reasoning tasks. Created EyePCR-MLLM, a domain-adapted variant of Qwen2.5-VL-7B.

Result: EyePCR-MLLM achieved highest accuracy on Perception MCQs among compared models, outperformed open-source models in Comprehension and Reasoning, and rivaled commercial models like GPT-4.1.

Conclusion: EyePCR reveals limitations of existing MLLMs in surgical cognition and establishes foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

Abstract: MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models’ cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

[264] Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

Saghir Alfasly, Wataru Uegami, MD Enamul Hoq, Ghazal Alabtah, H. R. Tizhoosh

Main category: cs.CV

TL;DR: A latent diffusion model for generating realistic histopathology images using dual-conditioning with semantic segmentation maps and tissue-specific visual crops, achieving competitive performance with real data in downstream tasks.

Details

Motivation: Address challenges in synthetic histopathology data generation: preserving tissue heterogeneity, capturing morphological features, and scaling to unannotated datasets.

Method: Dual-conditioning latent diffusion model combining semantic segmentation maps with raw tissue crops. For unannotated data, uses self-supervised clustering of whole-slide images into 100 tissue types using foundation model embeddings to generate pseudo-semantic maps.

Result: Reduces Frechet Distance by up to 6X on Camelyon16 (430.1 to 72.0) and 2-3x lower FD across Panda and TCGA. Downstream models trained on synthetic data achieve test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines.

Conclusion: The framework provides a practical solution for generating diverse, annotated histopathology data at scale, addressing critical bottlenecks in computational pathology.

Abstract: Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual-conditioning approach combining semantic segmentation maps with tissue-specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20-80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self-supervised extension that clusters whole-slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo-semantic maps for training. Our method synthesizes high-fidelity images with precise region-wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt-guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2-3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole-slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.

[265] CAMILA: Context-Aware Masking for Image Editing with Language Alignment

Hyunseung Kim, Chiho Choi, Srikanth Malla, Sai Prahladh Padmanabhan, Saurabh Bagchi, Joon Hee Choi

Main category: cs.CV

TL;DR: CAMILA is a context-aware image editing method that validates instruction coherence with images, applying only feasible edits while ignoring contradictory instructions to prevent nonsensical outputs.

Details

Motivation: Existing image editing models naively follow all user instructions, even infeasible or contradictory ones, leading to nonsensical results. There's a need for systems that can validate instruction feasibility before applying edits.

Method: Proposed CAMILA (Context-Aware Masking for Image Editing with Language Alignment) that validates contextual coherence between instructions and images, applies edits only to designated regions while ignoring non-executable instructions.

Result: CAMILA achieves better performance and higher semantic alignment than state-of-the-art models on constructed datasets for single- and multi-instruction editing with infeasible requests.

Conclusion: CAMILA effectively handles complex instruction challenges while preserving image integrity, demonstrating superiority in contextual validation over existing methods.

Abstract: Text-guided image editing has been allowing users to transform and synthesize images through natural language instructions, offering considerable flexibility. However, most existing image editing models naively attempt to follow all user instructions, even if those instructions are inherently infeasible or contradictory, often resulting in nonsensical output. To address these challenges, we propose a context-aware method for image editing named as CAMILA (Context-Aware Masking for Image Editing with Language Alignment). CAMILA is designed to validate the contextual coherence between instructions and the image, ensuring that only relevant edits are applied to the designated regions while ignoring non-executable instructions. For comprehensive evaluation of this new method, we constructed datasets for both single- and multi-instruction image editing, incorporating the presence of infeasible requests. Our method achieves better performance and higher semantic alignment than state-of-the-art models, demonstrating its effectiveness in handling complex instruction challenges while preserving image integrity.

[266] Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Main category: cs.CV

TL;DR: This paper shows that fine-tuning multimodal LLMs on Euclidean geometry problems (Euclid30K dataset) significantly improves their spatial reasoning abilities across multiple benchmarks without task-specific adaptations.

Details

Motivation: Spatial intelligence remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs), including abilities like visualizing shapes, mental rotation, position judgment, and numerosity estimation.

Method: Used Euclidean geometry problem-solving as a surrogate task, created Euclid30K dataset with 30K geometry problems, and employed Group Relative Policy Optimization (GRPO) to finetune Qwen2.5VL and RoboBrain2.0 model families.

Result: Models achieved substantial zero-shot gains across four spatial reasoning benchmarks. Mean VSI-Bench accuracy rose from 34.5% to 40.5% (+5.5 points). RoboBrain2.0-Euclid-7B achieved 49.6% accuracy, surpassing previous state-of-the-art Spatial-MLLM.

Conclusion: This is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills, demonstrating that Euclidean geometry training enables acquisition of general spatial reasoning capabilities.

Abstract: Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

[267] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Libo Zhu, Zihan Zhou, Xiaoyang Liu, Weihang Zhang, Keyu Shi, Yifan Fu, Yulun Zhang

Main category: cs.CV

TL;DR: RIFLE is a diffusion-based framework that removes flicker-banding (FB) from screen photos while preserving details, using a flicker-banding prior estimator and masked loss, with a simulation pipeline for training data.

Details

Motivation: Flicker-banding in screen photos is a common but underexplored problem that severely impacts readability and quality, unlike moire degradation which has been well-studied.

Method: Proposes RIFLE framework with flicker-banding prior estimator (FPE) to predict banding attributes, masked loss for focused supervision, and a simulation pipeline to generate realistic FB data with stochastic jitter, feathered boundaries, and sensor noise.

Result: RIFLE consistently outperforms recent image reconstruction baselines across quantitative metrics and visual comparisons on real-world dataset, from mild to severe flicker-banding.

Conclusion: This is the first work to research FB simulation and removal, establishing foundations for future research in dataset construction and removal model design.

Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera’s rolling-shutter readout and the display’s brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

[268] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

Main category: cs.CV

TL;DR: Causal-Adapter is a modular framework that adapts frozen text-to-image diffusion models for counterfactual image generation by enabling causal interventions on target attributes while preserving image identity.

Details

Motivation: Prior approaches rely on prompt engineering without explicit causal structure, lacking precise control over attribute interventions and their causal effects on dependent attributes.

Method: Leverages structural causal modeling with two attribute regularization strategies: prompt-aligned injection for semantic control and conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations.

Result: Achieves state-of-the-art performance with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation.

Conclusion: The approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91% MAE reduction on Pendulum for accurate attribute control and 87% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

[269] Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis

Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker

Main category: cs.CV

TL;DR: Seg-CFT is a method for generating counterfactual medical images that enables structure-specific interventions without requiring pixel-level labels, producing locally coherent changes while avoiding global artifacts.

Details

Motivation: Current counterfactual generation methods rely on external classifiers and are insufficient for structure-specific interventions, often causing undesirable global effects. Previous approaches required tedious pixel-level label maps.

Method: Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) uses structure-specific scalar variables to guide counterfactual generation, preserving simplicity while ensuring local coherence without needing hypothetical segmentations.

Result: The method successfully generates realistic chest radiographs and shows promising results for modeling coronary artery disease, producing effective counterfactuals with localized changes.

Conclusion: Seg-CFT provides a practical solution for structure-specific counterfactual image generation in medical imaging, eliminating the need for pixel-level labels while maintaining intervention effectiveness and local coherence.

Abstract: Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient’s age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

[270] GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction

Huaizhi Qu, Xiao Wang, Gengwei Zhang, Jie Peng, Tianlong Chen

Main category: cs.CV

TL;DR: GEM introduces a cryo-EM reconstruction framework using 3D Gaussian Splatting that achieves faster training, lower memory usage, and improved resolution compared to existing methods.

Details

Motivation: Address the computational expense and memory intensity of cryo-EM reconstruction, where traditional Fourier-space methods lose fidelity and neural radiance field approaches have cubic memory/computation overhead.

Method: Represents proteins with compact 3D Gaussians (11 parameters each) and uses a novel gradient computation design to reduce memory footprint and training cost.

Result: Achieves up to 48% faster training, 12% lower memory usage, and 38.8% improvement in local resolution on standard cryo-EM benchmarks.

Conclusion: GEM establishes a practical and scalable paradigm for cryo-EM reconstruction that unifies speed, efficiency, and high-resolution accuracy.

Abstract: Cryo-electron microscopy (cryo-EM) has become a central tool for high-resolution structural biology, yet the massive scale of datasets (often exceeding 100k particle images) renders 3D reconstruction both computationally expensive and memory intensive. Traditional Fourier-space methods are efficient but lose fidelity due to repeated transforms, while recent real-space approaches based on neural radiance fields (NeRFs) improve accuracy but incur cubic memory and computation overhead. Therefore, we introduce GEM, a novel cryo-EM reconstruction framework built on 3D Gaussian Splatting (3DGS) that operates directly in real-space while maintaining high efficiency. Instead of modeling the entire density volume, GEM represents proteins with compact 3D Gaussians, each parameterized by only 11 values. To further improve the training efficiency, we designed a novel gradient computation to 3D Gaussians that contribute to each voxel. This design substantially reduced both memory footprint and training cost. On standard cryo-EM benchmarks, GEM achieves up to 48% faster training and 12% lower memory usage compared to state-of-the-art methods, while improving local resolution by as much as 38.8%. These results establish GEM as a practical and scalable paradigm for cryo-EM reconstruction, unifying speed, efficiency, and high-resolution accuracy. Our code is available at https://github.com/UNITES-Lab/GEM.

[271] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang

Main category: cs.CV

TL;DR: Multimodal reasoning in Vision-Language Models enhances logical inference but can impair perceptual grounding through visual forgetting. The proposed VAPO method addresses this by steering reasoning toward visually grounded trajectories, achieving state-of-the-art results.

Details

Motivation: While multimodal reasoning shows promise for complex visual tasks, it may gradually impair basic visual recognition due to visual forgetting - where models increasingly disregard visual input during prolonged reasoning.

Method: Proposed Vision-Anchored Policy Optimization (VAPO), a method that explicitly steers the reasoning process toward visually grounded trajectories to maintain perceptual grounding.

Result: VAPO-Thinker-7B significantly strengthens visual information reliance and achieves new state-of-the-art results across multiple established benchmarks.

Conclusion: VAPO effectively addresses the visual forgetting problem in multimodal reasoning, balancing enhanced logical inference with maintained perceptual grounding for robust performance.

Abstract: Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

[272] LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang, Qinzhi Peng, Gan Xu, Yuxuan Lu, Yongjun Shen

Main category: cs.CV

TL;DR: This paper introduces the first systematic study of Multimodal Multi-Turn (MMT) dialogue safety, presents the MMDS dataset with 4,484 annotated samples, develops an automated red-teaming framework using MCTS, and proposes LLaVAShield for joint risk detection in user inputs and assistant responses.

Details

Motivation: Vision-Language Models (VLMs) moving into interactive multi-turn use create new safety risks that single-turn or single-modality moderation misses, as malicious intent can be spread across turns and images while context-sensitive replies may still advance harmful content.

Method: Systematic definition of MMT dialogue safety, creation of MMDS dataset with fine-grained annotations, development of automated multimodal multi-turn red-teaming framework using Monte Carlo Tree Search (MCTS), and building LLaVAShield for joint risk detection and assessment.

Result: LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. The MMDS dataset contains 4,484 annotated multimodal dialogue samples with safety ratings, policy dimensions, and evidence-based rationales.

Conclusion: The work provides the first systematic approach to MMT dialogue safety, with publicly released dataset and model to support future research in multimodal multi-turn content moderation.

Abstract: As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.

[273] VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

Abdelilah Aitrouga, Youssef Hmamouche, Amal El Fallah Seghrouchni

Main category: cs.CV

TL;DR: VRWKV-Editor is a novel video editing model that uses linear spatio-temporal aggregation with RWKV transformer to achieve linear complexity, enabling 3.7x speedup and 60% lower memory usage while maintaining competitive performance.

Details

Motivation: Current video editing models suffer from quadratic computational complexity of attention mechanisms, making them unsuitable for long-duration and high-resolution videos in practical real-time applications.

Method: Integrates linear spatio-temporal aggregation module into video-based diffusion models using bidirectional weighted key-value recurrence mechanism of RWKV transformer to capture global dependencies with linear complexity.

Result: Achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based methods, while maintaining competitive frame consistency and text alignment. Performance gap widens with longer videos.

Conclusion: VRWKV-Editor successfully addresses computational complexity limitations in video editing, enabling efficient processing of long-duration videos while preserving quality.

Abstract: In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.

[274] Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

Ran Tong, Jiaqi Liu, Su Liu, Jiexi Xu, Lanruo Wang, Tong Wang

Main category: cs.CV

TL;DR: Comparative analysis shows that zero-shot medical VLMs like BiomedCLIP can match or outperform supervised CNNs for chest radiograph interpretation when properly calibrated with optimized decision thresholds.

Details

Motivation: To evaluate the diagnostic performance of zero-shot medical vision-language models compared to supervised CNNs for chest radiograph interpretation tasks like pneumonia and tuberculosis detection.

Method: Comparative analysis between supervised lightweight CNN and zero-shot medical VLM (BiomedCLIP) on PneumoniaMNIST for pneumonia detection and Shenzhen TB dataset for tuberculosis detection, with decision threshold calibration on validation sets.

Result: After threshold calibration, BiomedCLIP achieved F1-score of 0.8841 vs CNN’s 0.8803 for pneumonia detection, and improved from 0.4812 to 0.7684 (vs CNN’s 0.7834) for tuberculosis detection.

Conclusion: Proper calibration is essential for unlocking the full diagnostic potential of zero-shot VLMs, enabling them to compete with or surpass task-specific supervised models in medical imaging tasks.

Abstract: The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN’s 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline’s 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

[275] Normal-Abnormal Guided Generalist Anomaly Detection

Yuexin Wang, Xiaolei Wang, Yizheng Gong, Jimin Xiao

Main category: cs.CV

TL;DR: NAGL is a generalist anomaly detection framework that uses both normal and abnormal samples as references, outperforming previous methods that only used normal samples.

Details

Motivation: Previous GAD methods only used normal samples as references, ignoring valuable information from anomalous samples that are often available in real-world scenarios.

Method: Proposed Normal-Abnormal Generalist Learning (NAGL) with two components: Residual Mining (RM) to extract abnormal patterns from normal-abnormal reference residuals, and Anomaly Feature Learning (AFL) to adaptively learn anomaly features through residual mapping.

Result: Extensive experiments across multiple benchmarks show the method significantly outperforms existing GAD approaches.

Conclusion: This is the first work to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection, enabling more accurate and efficient cross-domain anomaly detection.

Abstract: Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.

[276] Multi-Domain Brain Vessel Segmentation Through Feature Disentanglement

Francesco Galati, Daniele Falcetta, Rosa Cortese, Ferran Prados, Ninon Burgos, Maria A. Zuluaga

Main category: cs.CV

TL;DR: A domain adaptation framework for brain vessel segmentation that uses image-to-image translation to handle multiple imaging modalities without domain-specific model design or data harmonization.

Details

Motivation: Brain vessel morphology is complex and requires comprehensive understanding across different imaging modalities for accurate treatment of brain conditions, but current models typically focus on single modalities.

Method: Employs disentanglement techniques to independently manipulate image properties, specifically focusing on changing vessel appearances while preserving spatial information (shapes and locations) crucial for segmentation.

Result: Effectively bridges large domain gaps across medical centers, image modalities, and vessel types, demonstrating robustness and versatility in cerebrovascular image segmentation.

Conclusion: The framework shows potential for accurate cerebrovascular segmentation in multiple scenarios through domain adaptation methodologies, with code publicly available.

Abstract: The intricate morphology of brain vessels poses significant challenges for automatic segmentation models, which usually focus on a single imaging modality. However, accurately treating brain-related conditions requires a comprehensive understanding of the cerebrovascular tree, regardless of the specific acquisition procedure. Our framework effectively segments brain arteries and veins in various datasets through image-to-image translation while avoiding domain-specific model design and data harmonization between the source and the target domain. This is accomplished by employing disentanglement techniques to independently manipulate different image properties, allowing them to move from one domain to another in a label-preserving manner. Specifically, we focus on manipulating vessel appearances during adaptation while preserving spatial information, such as shapes and locations, which are crucial for correct segmentation. Our evaluation effectively bridges large and varied domain gaps across medical centers, image modalities, and vessel types. Additionally, we conduct ablation studies on the optimal number of required annotations and other architectural choices. The results highlight our framework’s robustness and versatility, demonstrating the potential of domain adaptation methodologies to perform cerebrovascular image segmentation in multiple scenarios accurately. Our code is available at https://github.com/i-vesseg/MultiVesSeg.

[277] Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud, Jérémy Scanvic, Quentin Barthélemy, Patrice Abry, Julián Tachella

Main category: cs.CV

TL;DR: A new self-supervised learning method for inverse problems using equivariant networks and splitting losses that matches supervised performance without ground-truth data.

Details

Motivation: Ground-truth data for training reconstruction networks is often expensive or impossible to obtain, especially for inverse problems with incomplete measurements.

Method: Combines self-supervised splitting losses with equivariant reconstruction networks, defining a new notion of equivariance for reconstruction tasks.

Result: Achieves state-of-the-art performance on image inpainting, accelerated MRI, and compressive sensing, particularly with highly rank-deficient forward models.

Conclusion: The proposed self-supervised approach effectively replaces supervised learning for inverse problems, enabling learning-based solutions without ground-truth references.

Abstract: Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in the same minimizer in expectation as the one of a supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.

[278] SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko

Main category: cs.CV

TL;DR: SoftCFG is a training-free, model-agnostic method that addresses guidance diminishing and over-guidance issues in autoregressive image generation by distributing adaptive perturbations across all tokens with certainty weighting and Step Normalization.

Details

Motivation: Classifier-Free Guidance (CFG) in autoregressive models suffers from guidance diminishing (conditional-unconditional gap vanishes during decoding) and over-guidance (strong conditions distort visual coherence), limiting their effectiveness in conditional image generation.

Method: Proposes SoftCFG with uncertainty-guided inference that distributes adaptive perturbations across all tokens using certainty weighting, and introduces Step Normalization to bound cumulative perturbations for stable long-sequence generation.

Result: SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256×256 among autoregressive models.

Conclusion: SoftCFG effectively resolves CFG issues in autoregressive models through adaptive token-wise guidance and stabilization techniques, enabling better conditional image generation without requiring model retraining.

Abstract: Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

cs.AI

[279] OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models

Jianzhang Zhang, Jialong Zhou, Chuang Liu

Main category: cs.AI

TL;DR: OR-Toolformer fine-tunes Llama-3.1-8B-Instruct with tool augmentation to solve OR problems, achieving up to 80.1% accuracy and strong generalization to unseen problem types.

Details

Motivation: Address privacy concerns with closed-source LLM APIs for OR tasks and reduce high compute costs of training open-source models from scratch.

Method: Fine-tunes Llama-3.1-8B-Instruct using semi-automatic data synthesis pipeline that generates diverse OR problem-answer pairs and augments the model with external solvers to produce API calls.

Result: Achieves up to 80.1% execution accuracy on three of four standard benchmarks (exceeding baselines by over 4.3%) and 54% average accuracy on unseen OR problem types (21 percentage-point improvement over strongest baseline).

Conclusion: Tool-augmented fine-tuning of LLMs is effective for accurate and generalizable OR problem modeling and solving.

Abstract: Large language models (LLMs) demonstrate strong mathematical reasoning, but reliance on closed-source APIs for OR tasks raises privacy concerns, and training open-source models from scratch incurs high compute costs. We introduce OR-Toolformer, which fine-tunes Llama-3.1-8B-Instruct with a semi-automatic data synthesis pipeline that generates diverse OR problem-answer pairs and augments the model with external solvers to produce API calls. On three of four standard benchmarks, OR-Toolformer achieves up to 80.1% execution accuracy, exceeding size-matched baselines by over 4.3%. In zero-shot evaluation on two unseen OR problem types, it attains 54% average accuracy, a 21 percentage-point improvement over the strongest baseline. These findings validate the efficacy of tool-augmented fine-tuning LLMs for accurate and generalizable OR problem modeling and solving.

[280] Modeling Others’ Minds as Code

Kunal Jha, Aydan Yuenan Huang, Eric Ye, Natasha Jaques, Max Kleiman-Weiner

Main category: cs.AI

TL;DR: ROTE models human behavior as predictable behavioral programs rather than rational policies, using LLMs for program synthesis and probabilistic inference for uncertainty, achieving 50% better accuracy than baselines.

Details

Motivation: Existing human behavior modeling approaches are data-hungry and brittle due to unrealistic rationality assumptions or computational demands. Many social interactions follow predictable patterns that minimize cognitive load.

Method: ROTE treats action understanding as program synthesis, using LLMs to generate hypothesis space of behavioral programs and probabilistic inference to reason about uncertainty over that space.

Result: ROTE outperforms competitive baselines (behavior cloning and LLM-based methods) by up to 50% in both in-sample accuracy and out-of-sample generalization across gridworld tasks and embodied household simulator.

Conclusion: By modeling behavior as programs rather than belief-conditioned policies, ROTE enables efficient and effective human behavior prediction for real-world AI systems.

Abstract: Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient “scripts” that minimize cognitive load for actors and observers, e.g., “wait for the green light, then go.” We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines – including behavior cloning and LLM-based methods – by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.

Zarreen Reza

Main category: cs.AI

TL;DR: A multi-agent debate framework for evaluating emergent social behaviors in LLM agents, revealing strong consensus-seeking tendencies and measurable psychometric profiles.

Details

Motivation: Traditional benchmarks are insufficient for capturing emergent social and cognitive dynamics in autonomous LLM agents during communication, persuasion, and collaboration.

Method: Multi-agent debate framework with LLM agents having distinct personas and incentives, supervised by an LLM moderator, analyzed using psychometric and semantic metrics.

Result: Agents consistently reach high semantic agreement (μ > 0.88) even without explicit instruction, personas induce stable psychometric profiles, and moderator persona significantly alters debate outcomes.

Conclusion: Provides a blueprint for dynamic, psychometrically grounded evaluation protocols for understanding and shaping social behaviors in next-generation AI agents.

Abstract: As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social and cognitive dynamics that arise when agents communicate, persuade, and collaborate in interactive environments. To address this gap, we introduce a novel evaluation framework that uses multi-agent debate as a controlled “social laboratory” to discover and quantify these behaviors. In our framework, LLM-based agents, instantiated with distinct personas and incentives, deliberate on a wide range of challenging topics under the supervision of an LLM moderator. Our analysis, enabled by a new suite of psychometric and semantic metrics, reveals several key findings. Across hundreds of debates, we uncover a powerful and robust emergent tendency for agents to seek consensus, consistently reaching high semantic agreement ({\mu}

0.88) even without explicit instruction and across sensitive topics. We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort, and that the moderators persona can significantly alter debate outcomes by structuring the environment, a key finding for external AI alignment. This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols designed for the agentic setting, offering a crucial methodology for understanding and shaping the social behaviors of the next generation of AI agents. We have released the code and results at https://github.com/znreza/multi-agent-LLM-eval-for-debate.

[282] Cyber Academia-Chemical Engineering (CA-ChemE): A Living Digital Town for Self-Directed Research Evolution and Emergent Scientific Discovery

Zekun Jiang, Chunming Xu, Tianhang Zhou

Main category: cs.AI

TL;DR: The CA-ChemE system is a multi-agent digital ecosystem that enables autonomous scientific discovery in chemical engineering through knowledge-enhanced collaboration, addressing interdisciplinary gaps and improving research efficiency.

Details

Motivation: To overcome limitations in existing AI systems for chemical engineering, particularly their inability to handle interdisciplinary collaboration and explore uncharted problems autonomously.

Method: Developed a living digital town with multi-agent collaboration, integrating domain-specific knowledge bases, knowledge enhancement technologies, and a Collaboration Agent with ontology engineering capabilities.

Result: Knowledge base enhancement improved dialogue quality by 10-15%, while the Collaboration Agent achieved 8.5% improvements for distant-domain expert pairs versus only 0.8% for domain-proximate pairs, revealing a 10.6-fold difference in collaborative efficiency.

Conclusion: Carefully designed multi-agent architectures provide a viable pathway toward autonomous scientific discovery in chemical engineering by addressing knowledge-base gaps and enabling effective interdisciplinary collaboration.

Abstract: The rapid advancement of artificial intelligence (AI) has demonstrated substantial potential in chemical engineering, yet existing AI systems remain limited in interdisciplinary collaboration and exploration of uncharted problems. To address these issues, we present the Cyber Academia-Chemical Engineering (CA-ChemE) system, a living digital town that enables self-directed research evolution and emergent scientific discovery through multi-agent collaboration. By integrating domain-specific knowledge bases, knowledge enhancement technologies, and collaboration agents, the system successfully constructs an intelligent ecosystem capable of deep professional reasoning and efficient interdisciplinary collaboration. Our findings demonstrate that knowledge base-enabled enhancement mechanisms improved dialogue quality scores by 10-15% on average across all seven expert agents, fundamentally ensuring technical judgments are grounded in verifiable scientific evidence. However, we observed a critical bottleneck in cross-domain collaboration efficiency, prompting the introduction of a Collaboration Agent (CA) equipped with ontology engineering capabilities. CA’s intervention achieved 8.5% improvements for distant-domain expert pairs compared to only 0.8% for domain-proximate pairs - a 10.6-fold difference - unveiling the “diminished collaborative efficiency caused by knowledge-base gaps” effect. This study demonstrates how carefully designed multi-agent architectures can provide a viable pathway toward autonomous scientific discovery in chemical engineering.

[283] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao

Main category: cs.AI

TL;DR: AGILE is an agentic jigsaw interaction learning method that enhances visual perception and reasoning in VLMs through interactive jigsaw solving, achieving significant performance improvements on jigsaw tasks and strong generalization across 9 vision tasks.

Details

Motivation: Current VLMs have limited fundamental perceptual and reasoning abilities, performing poorly even on simple jigsaw tasks. The scarcity and limited scalability of high-quality vision-language data constrains improving these capabilities.

Method: AGILE formulates jigsaw solving as an interactive process where the model generates executable code to perform actions based on current state, while the environment provides fine-grained visual feedback to guide task completion through iterative observation and interaction cycles.

Result: AGILE substantially boosts jigsaw task performance (increasing accuracy from 9.5% to 82.8% under 2×2 setting) and demonstrates strong generalization across 9 general vision tasks with average 3.1% improvement.

Conclusion: AGILE provides an efficient, scalable solution to multimodal data scarcity and opens new avenues for advancing reasoning and generalization in multimodal models through interactive learning.

Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .

[284] Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern-Leister, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, Martin Michelsen, Riley Patterson, Eric Rodriguez, Laura Scharff, Vikram Shanker, Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas Tamas, Vlad Tenev, Jonathan Thomm, Harold Williams, Lawrence Wu

Main category: cs.AI

TL;DR: Aristotle is an AI system that combines formal verification with informal reasoning, achieving gold-medal-level performance on 2025 IMO problems through integration of Lean proof search, lemma generation/formalization, and a geometry solver.

Details

Motivation: To advance automated theorem proving by combining formal verification with human-like informal reasoning approaches, enabling state-of-the-art performance on challenging mathematical problems like the International Mathematical Olympiad.

Method: Integrates three main components: a Lean proof search system for formal verification, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver for geometric problems.

Result: Achieved gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems, demonstrating state-of-the-art performance with favorable scaling properties for automated theorem proving.

Conclusion: The Aristotle system successfully demonstrates that combining formal verification with informal reasoning enables superior performance in automated theorem proving, particularly on complex mathematical problems like IMO challenges.

Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.

[285] MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

Darshan Deshpande, Varun Gangal, Hersh Mehta, Anand Kannappan, Rebecca Qian, Peng Wang

Main category: cs.AI

TL;DR: MEMTRACK is a benchmark for evaluating long-term memory and state tracking in multi-platform agent environments, focusing on realistic organizational workflows across communication and productivity platforms.

Details

Motivation: Existing context and memory benchmarks focus on conversational instances, but there's a need for evaluating memory in dynamic enterprise environments for effective application.

Method: MEMTRACK integrates asynchronous events across multiple platforms (Slack, Linear, Git) with noisy, conflicting, cross-referring information. Dataset is curated through manual expert design and agent-based synthesis of real-world software development scenarios.

Result: Experiments show challenges in long-horizon memory utilization, cross-platform dependencies, and contradiction resolution. The best performing GPT-5 model achieves only 60% Correctness score.

Conclusion: MEMTRACK provides an extensible framework for advancing memory-augmented agent evaluation beyond conversational setups, enabling multi-agent, multi-platform memory benchmarking in complex organizational settings.

Abstract: Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings

[286] To Mask or to Mirror: Human-AI Alignment in Collective Reasoning

Crystal Qian, Aaron Parisi, Clémentine Bouleau, Vivian Tsai, Maël Lebreton, Lucas Dixon

Main category: cs.AI

TL;DR: LLMs show varying alignment with human collective decision-making in social reasoning tasks, with some mirroring human biases and others attempting to compensate for them.

Details

Motivation: To examine how large language models align with human social reasoning in collective decision-making contexts, moving beyond individual-level analysis.

Method: Conducted large-scale online experiment (N=748) using Lost at Sea social psychology task, comparing human groups with visible demographic attributes vs pseudonymous aliases, then simulated matched LLM groups (Gemini 2.5, GPT 4.1, Claude Haiku 3.5, Gemma 3) conditioned on human data.

Result: LLM behaviors diverged significantly - some models mirrored human biases while others masked these biases and attempted to compensate for them.

Conclusion: Human-AI alignment in collective reasoning depends on context, cues, and model-specific inductive biases, requiring dynamic benchmarks that capture collective reasoning complexities.

Abstract: As large language models (LLMs) are increasingly used to model and augment collective decision-making, it is critical to examine their alignment with human social reasoning. We present an empirical framework for assessing collective alignment, in contrast to prior work on the individual level. Using the Lost at Sea social psychology task, we conduct a large-scale online experiment (N=748), randomly assigning groups to leader elections with either visible demographic attributes (e.g. name, gender) or pseudonymous aliases. We then simulate matched LLM groups conditioned on the human data, benchmarking Gemini 2.5, GPT 4.1, Claude Haiku 3.5, and Gemma 3. LLM behaviors diverge: some mirror human biases; others mask these biases and attempt to compensate for them. We empirically demonstrate that human-AI alignment in collective reasoning depends on context, cues, and model-specific inductive biases. Understanding how LLMs align with collective human behavior is critical to advancing socially-aligned AI, and demands dynamic benchmarks that capture the complexities of collective reasoning.

[287] Retrieval-Augmented Framework for LLM-Based Clinical Decision Support

Leon Garza, Anantaa Kotal, Michael A. Grasso, Emre Umucu

Main category: cs.AI

TL;DR: A clinical decision support system using Large Language Models (LLMs) with retrieval-augmented generation (RAG) to assist clinicians in prescribing decisions by analyzing EHR data and generating therapeutic suggestions based on similar historical cases.

Details

Motivation: Address the complexity of clinical decision-making and leverage the growing availability of electronic health records to provide data-informed care support for prescribing clinicians.

Method: Uses a retrieval-augmented generation (RAG) pipeline that integrates natural language processing with structured clinical inputs, harmonizing unstructured narratives and codified data to support LLM-based inference for generating therapeutic suggestions.

Result: Preliminary evaluations with de-identified and synthetic clinical datasets show promising clinical plausibility and consistency of the model’s outputs, suggesting LLM-based tools may provide valuable decision support in prescribing workflows.

Conclusion: This work represents an initial step toward integrating generative AI into real-world clinical decision-making with emphasis on transparency, safety, and alignment with established practices, augmenting rather than replacing clinician judgment.

Abstract: The increasing complexity of clinical decision-making, alongside the rapid expansion of electronic health records (EHR), presents both opportunities and challenges for delivering data-informed care. This paper proposes a clinical decision support system powered by Large Language Models (LLMs) to assist prescribing clinicians. The system generates therapeutic suggestions by analyzing historical EHR data, including patient demographics, presenting complaints, clinical symptoms, diagnostic information, and treatment histories. The framework integrates natural language processing with structured clinical inputs to produce contextually relevant recommendations. Rather than replacing clinician judgment, it is designed to augment decision-making by retrieving and synthesizing precedent cases with comparable characteristics, drawing on local datasets or federated sources where applicable. At its core, the system employs a retrieval-augmented generation (RAG) pipeline that harmonizes unstructured narratives and codified data to support LLM-based inference. We outline the system’s technical components, including representation representation alignment and generation strategies. Preliminary evaluations, conducted with de-identified and synthetic clinical datasets, examine the clinical plausibility and consistency of the model’s outputs. Early findings suggest that LLM-based tools may provide valuable decision support in prescribing workflows when appropriately constrained and rigorously validated. This work represents an initial step toward integration of generative AI into real-world clinical decision-making with an emphasis on transparency, safety, and alignment with established practices.

[288] Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

Main category: cs.AI

TL;DR: TRACE detects implicit reward hacking by measuring how early a model’s reasoning becomes sufficient to pass verification, identifying shortcuts in reasoning chains.

Details

Motivation: Reward hacking poses a significant threat where models exploit loopholes in reward functions without solving intended tasks, which can be implicit and bypass chain-of-thought monitors.

Method: TRACE progressively truncates a model’s chain-of-thought at various lengths, forces the model to answer, and measures verifier-passing rates at each cutoff to quantify reasoning effort.

Result: TRACE achieves over 65% gains over 72B CoT monitors in math reasoning and over 30% gains over 32B monitors in coding, and can discover unknown loopholes during training.

Conclusion: TRACE provides a scalable unsupervised approach for oversight where current monitoring methods are ineffective against implicit reward hacking.

Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to pass a verifier. We progressively truncate a model’s CoT at various lengths, force the model to answer, and measure the verifier-passing rate at each cutoff. A hacking model, which takes a shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

[289] Fine-tuning with RAG for Improving LLM Learning of New Skills

Humaid Ibrahim, Nikolai Rozanov, Marek Rei

Main category: cs.AI

TL;DR: A distillation method that converts inference-time retrieval into learned competence by extracting hints from failures, generating improved teacher trajectories, and training student models to internalize knowledge without runtime dependencies.

Details

Motivation: LLM agents frequently fail in predictable ways during multi-step tasks, and while RAG can help, it requires external knowledge databases and adds computational overhead at every deployment.

Method: Extract compact hints from agent failures, use hints to generate improved teacher trajectories via one-shot retrieval at episode start, and train student models on these trajectories with hint strings removed to force internalization.

Result: Distilled students achieved 91% success on ALFWorld (vs. 79% baselines) and 72 WebShop scores (vs. 61 baselines), using 10-60% fewer tokens than retrieval-augmented teachers.

Conclusion: Retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies, generalizing across model scales and agent architectures.

Abstract: Large language model (LLM) agents deployed for multi-step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval-augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference-time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one-shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. 79% for baselines) and improving WebShop scores to 72 (vs. 61 for baselines), while using 10-60% fewer tokens than retrieval-augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies.

[290] Automating Data-Driven Modeling and Analysis for Engineering Applications using Large Language Model Agents

Yang Liu, Zaid Abulawi, Abhiram Garimidi, Doyeong Lim

Main category: cs.AI

TL;DR: LLM agents automate data-driven modeling for regression tasks, achieving performance comparable to human experts on critical heat flux prediction.

Details

Motivation: Address limitations of traditional data-driven methods that require extensive manual intervention and lack scalability for large scientific datasets.

Method: Two LLM-agent frameworks: multi-agent system with specialized agents, and single-agent system using ReAct paradigm, both handling data preprocessing, neural network development, training, hyperparameter optimization, and uncertainty quantification.

Result: LLM-agent-developed model outperforms traditional CHF lookup tables and matches predictive accuracy and uncertainty quantification of state-of-the-art Bayesian optimized deep neural networks developed by human experts.

Conclusion: LLM-based agents show significant potential to automate complex engineering modeling tasks, reducing human workload while meeting or exceeding current performance standards.

Abstract: Modern engineering increasingly relies on vast datasets generated by experiments and simulations, driving a growing demand for efficient, reliable, and broadly applicable modeling strategies. There is also heightened interest in developing data-driven approaches, particularly neural network models, for effective prediction and analysis of scientific datasets. Traditional data-driven methods frequently involve extensive manual intervention, limiting their ability to scale effectively and generalize to diverse applications. In this study, we propose an innovative pipeline utilizing Large Language Model (LLM) agents to automate data-driven modeling and analysis, with a particular emphasis on regression tasks. We evaluate two LLM-agent frameworks: a multi-agent system featuring specialized collaborative agents, and a single-agent system based on the Reasoning and Acting (ReAct) paradigm. Both frameworks autonomously handle data preprocessing, neural network development, training, hyperparameter optimization, and uncertainty quantification (UQ). We validate our approach using a critical heat flux (CHF) prediction benchmark, involving approximately 25,000 experimental data points from the OECD/NEA benchmark dataset. Results indicate that our LLM-agent-developed model surpasses traditional CHF lookup tables and delivers predictive accuracy and UQ on par with state-of-the-art Bayesian optimized deep neural network models developed by human experts. These outcomes underscore the significant potential of LLM-based agents to automate complex engineering modeling tasks, greatly reducing human workload while meeting or exceeding existing standards of predictive performance.

[291] OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models

Luca Cotti, Idilio Drago, Anisa Rula, Devis Bianchini, Federico Cerutti

Main category: cs.AI

TL;DR: OntoLogX is an AI agent that uses LLMs to transform raw system logs into ontology-grounded knowledge graphs for extracting actionable cyber threat intelligence, with RAG and iterative correction for validity.

Details

Motivation: System logs contain valuable cyber threat intelligence but are limited by lack of structure, semantic inconsistency, and fragmentation, making it difficult to extract coherent threat information.

Method: Uses LLMs with lightweight log ontology, Retrieval Augmented Generation (RAG), and iterative correction to generate valid knowledge graphs, then aggregates KGs into sessions and maps to MITRE ATT&CK tactics.

Result: Robust KG generation across multiple backends and accurate mapping to ATT&CK tactics on both benchmark and real-world honeypot datasets, with improved precision and recall through retrieval and correction.

Conclusion: Ontology-grounded representations with LLMs and correction mechanisms effectively extract actionable CTI from noisy logs, with code-oriented models showing particular effectiveness in structured log analysis.

Abstract: System logs represent a valuable source of Cyber Threat Intelligence (CTI), capturing attacker behaviors, exploited vulnerabilities, and traces of malicious activity. Yet their utility is often limited by lack of structure, semantic inconsistency, and fragmentation across devices and sessions. Extracting actionable CTI from logs therefore requires approaches that can reconcile noisy, heterogeneous data into coherent and interoperable representations. We introduce OntoLogX, an autonomous Artificial Intelligence (AI) agent that leverages Large Language Models (LLMs) to transform raw logs into ontology-grounded Knowledge Graphs (KGs). OntoLogX integrates a lightweight log ontology with Retrieval Augmented Generation (RAG) and iterative correction steps, ensuring that generated KGs are syntactically and semantically valid. Beyond event-level analysis, the system aggregates KGs into sessions and employs a LLM to predict MITRE ATT&CK tactics, linking low-level log evidence to higher-level adversarial objectives. We evaluate OntoLogX on both logs from a public benchmark and a real-world honeypot dataset, demonstrating robust KG generation across multiple KGs backends and accurate mapping of adversarial activity to ATT&CK tactics. Results highlight the benefits of retrieval and correction for precision and recall, the effectiveness of code-oriented models in structured log analysis, and the value of ontology-grounded representations for actionable CTI extraction.

[292] A Tale of LLMs and Induced Small Proxies: Scalable Agents for Knowledge Mining

Sipeng Zhang, Longfei Yun, Zilong Wang, Jingbo Shang, Letian Peng

Main category: cs.AI

TL;DR: Falconer is a collaborative framework that combines LLMs’ reasoning with lightweight proxy models for scalable knowledge mining, reducing costs by 90% while maintaining accuracy.

Details

Motivation: LLMs are too expensive for large-scale knowledge mining, while traditional pipelines are brittle and can't generalize to new tasks.

Method: Uses LLMs as planners to decompose instructions and as annotators to train small proxy models, unifying classification and extraction into two atomic operations.

Result: Matches state-of-the-art LLMs in accuracy while reducing inference cost by 90% and accelerating knowledge mining by 20x.

Conclusion: Falconer offers an efficient and scalable foundation for Deep Research by combining LLM reasoning with cost-effective proxy models.

Abstract: At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.

[293] On the Role of Domain Experts in Creating Effective Tutoring Systems

Sarath Sreedharan, Kelsey Sikes, Nathaniel Blanchard, Lisa Mason, Nikhil Krishnaswamy, Jill Zarestky

Main category: cs.AI

TL;DR: The paper discusses how expert-curated knowledge can enhance AI tutoring systems through explainable AI for lesson generation and curriculum-based adaptive tutoring.

Details

Motivation: To highlight the overlooked role of domain expert knowledge in creating effective AI tutoring systems and demonstrate its potential benefits.

Method: Proposes two approaches: 1) Using explainable AI (XAI) techniques with expert-specified rules to automatically generate lessons, 2) Leveraging expert-defined curricula to develop adaptive tutoring systems with more efficient algorithms.

Result: Presents a case study of creating a tutoring system for pollinator identification where expert knowledge can be easily elicited and applied.

Conclusion: Expert-curated knowledge plays a crucial role in developing novel educational systems and can significantly improve the effectiveness and efficiency of AI tutoring systems.

Abstract: The role that highly curated knowledge, provided by domain experts, could play in creating effective tutoring systems is often overlooked within the AI for education community. In this paper, we highlight this topic by discussing two ways such highly curated expert knowledge could help in creating novel educational systems. First, we will look at how one could use explainable AI (XAI) techniques to automatically create lessons. Most existing XAI methods are primarily aimed at debugging AI systems. However, we will discuss how one could use expert specified rules about solving specific problems along with novel XAI techniques to automatically generate lessons that could be provided to learners. Secondly, we will see how an expert specified curriculum for learning a target concept can help develop adaptive tutoring systems, that can not only provide a better learning experience, but could also allow us to use more efficient algorithms to create these systems. Finally, we will highlight the importance of such methods using a case study of creating a tutoring system for pollinator identification, where such knowledge could easily be elicited from experts.

[294] VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

Main category: cs.AI

TL;DR: VOGUE introduces visual uncertainty guided exploration for MLLMs, treating images as stochastic contexts to improve reasoning by quantifying policy sensitivity to visual perturbations.

Details

Motivation: Current RLVR methods for MLLMs struggle with exploration and treat visual input as deterministic, overlooking visual ambiguity and failing to build robust policies against visual variations.

Method: VOGUE shifts exploration from text to visual space by treating images as stochastic contexts, using symmetric KL divergence between raw and noisy branches to quantify visual uncertainty, and combining uncertainty-proportional bonus with token-entropy bonus and annealed sampling.

Result: VOGUE boosts pass@1 accuracy by 2.6% on visual math benchmarks and 3.7% on general-domain reasoning benchmarks, improves pass@4 performance, and mitigates exploration decay in RL fine-tuning.

Conclusion: Grounding exploration in visual input uncertainty is an effective strategy for improving multimodal reasoning in MLLMs.

Abstract: Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy’s sensitivity to visual perturbations using the symmetric KL divergence between a “raw” and “noisy” branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.

[295] AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

Bill Marino, Rosco Hunter, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Christoph Schnabl, Felix Steffek, Hongkai Wen, Nicholas D. Lane

Main category: cs.AI

TL;DR: AIReg-Bench is the first benchmark dataset for evaluating LLMs’ ability to assess compliance with the EU AI Act, created through LLM-generated technical documentation samples and expert legal annotations.

Details

Motivation: As governments regulate AI, there's growing interest in using LLMs to assess AI system compliance with regulations, but no existing benchmarks to evaluate LLM performance on this task.

Method: Created dataset through two-step process: (1) prompting LLM to generate 120 technical documentation excerpts describing plausible AI systems, (2) legal experts reviewing and annotating each sample for AI Act violations.

Result: Established AIReg-Bench dataset with expert-annotated compliance labels, providing a foundation to evaluate frontier LLMs’ ability to reproduce expert compliance assessments.

Conclusion: AIReg-Bench provides a starting point to understand opportunities and limitations of LLM-based AI regulation compliance assessment tools and establishes a benchmark for comparing future LLMs.

Abstract: As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts’ compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

[296] Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates

Abhinav Madahar

Main category: cs.AI

TL;DR: Lateral Tree-of-Thoughts (LToT) addresses two pathologies in large-budget search: breadth saturation (duplicate candidates) and depth myopia (premature pruning). It separates utility from consistency, treating low-utility but consistent candidates as assets via lateral racing with short-circuit promotion.

Details

Motivation: Standard Tree-of-Thoughts search suffers from breadth saturation (additional samples produce near-duplicates) and depth myopia (noisy short-horizon utilities prune branches with delayed payoffs) when given large test-time compute budgets.

Method: LToT splits the frontier into mainlines (high-utility candidates) and laterals (consistent but low-utility candidates). It uses Lateral Racing with Short-Circuit (LR-SC): capped successive-halving races with width-aware thresholds and immediate promotion when branches clear the mainline bar.

Result: Theoretical analysis shows pseudolinear lateral cost Θ(N₀ log_η N₀) with logarithmic rungs, contrasting with exponential growth of uncapped mainlines. Empirical evaluations are in preparation.

Conclusion: LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.

Abstract: Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR–SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $\Theta(N_0 \log_{\eta} N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $\eta>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.

[297] Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

Daniel Zhao, Abhilash Shankarampeta, Lanxiang Hu, Tajana Rosing, Hao Zhang

Main category: cs.AI

TL;DR: A method using sparse autoencoders and clustering to analyze LLM token representations and guide mathematical reasoning generations by balancing exploitation and exploration.

Details

Motivation: To improve mathematical reasoning in LLMs by analyzing internal token representations and guiding generations to achieve better balance between exploitation (following established reasoning patterns) and exploration (generating diverse solutions).

Method: Trains sparse autoencoders to create sparse vector representations of tokens, applies k-means clustering to build a graph of token clusters with weighted edges for sequential transitions, defines edge-weight based reward function to quantify reasoning adherence, and measures generation diversity from clustering.

Result: The approach successfully identifies exploitative reasoning trajectories and assesses exploration extent, showing that balancing both is crucial for high accuracy in mathematical reasoning tasks.

Conclusion: Sparse autoencoders can serve as scalable reward models to guide LLM generations, ensuring balanced trade-off between exploitation and exploration, which prevents extreme behaviors and fosters higher-quality reasoning processes.

Abstract: We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.

[298] LOGicalThought: Logic-Based Ontological Grounding of LLMs for High-Assurance Reasoning

Navapat Nananukul, Yue Zhang, Ryan Lee, Eric Boxer, Jonathan May, Vibhav Giridhar Gogate, Jay Pujara, Mayank Kejriwal

Main category: cs.AI

TL;DR: LOGicalThought (LogT) is a neurosymbolic architecture that combines logical reasoning with LLMs to improve high-assurance reasoning in domains like law and medicine, achieving significant performance gains over baselines.

Details

Motivation: High-assurance reasoning in critical domains requires accurate, verifiable conclusions grounded in evidence, but LLMs struggle with defeasible logic and exceptions inherent in legal/medical texts.

Method: LogT uses an advanced logical language and reasoner with LLMs to create dual symbolic graph and logic-based contexts, transforming long-form guidelines into compact grounded evaluation.

Result: LogT improves overall performance by 11.84% across all LLMs, with gains of +10.2% on negation, +13.2% on implication, and +5.5% on defeasible reasoning compared to strongest baseline.

Conclusion: The neurosymbolic approach effectively addresses LLMs’ limitations in high-assurance reasoning by integrating logical reasoning capabilities, demonstrating significant improvements across multiple reasoning modes.

Abstract: High-assurance reasoning, particularly in critical domains such as law and medicine, requires conclusions that are accurate, verifiable, and explicitly grounded in evidence. This reasoning relies on premises codified from rules, statutes, and contracts, inherently involving defeasible or non-monotonic logic due to numerous exceptions, where the introduction of a single fact can invalidate general rules, posing significant challenges. While large language models (LLMs) excel at processing natural language, their capabilities in standard inference tasks do not translate to the rigorous reasoning required over high-assurance text guidelines. Core reasoning challenges within such texts often manifest specific logical structures involving negation, implication, and, most critically, defeasible rules and exceptions. In this paper, we propose a novel neurosymbolically-grounded architecture called LOGicalThought (LogT) that uses an advanced logical language and reasoner in conjunction with an LLM to construct a dual symbolic graph context and logic-based context. These two context representations transform the problem from inference over long-form guidelines into a compact grounded evaluation. Evaluated on four multi-domain benchmarks against four baselines, LogT improves overall performance by 11.84% across all LLMs. Performance improves significantly across all three modes of reasoning: by up to +10.2% on negation, +13.2% on implication, and +5.5% on defeasible reasoning compared to the strongest baseline.

[299] Information Seeking for Robust Decision Making under Partial Observability

Djengo Cyun-Jyun Fang, Tsung-Wei Ke

Main category: cs.AI

TL;DR: InfoSeeker is an LLM decision-making framework that integrates task-oriented planning with active information seeking to handle uncertainty in partially observable environments, achieving 74% performance improvement over prior methods.

Details

Motivation: Existing LLM planning agents address observational uncertainty but overlook discrepancies between their internal dynamics and actual environment, limiting their effectiveness in practical problem-solving scenarios with incomplete information.

Method: InfoSeeker prompts LLMs to actively gather information by planning actions to validate understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans.

Result: InfoSeeker achieves 74% absolute performance gain over prior methods without sacrificing sample efficiency, and generalizes well across LLMs and established benchmarks like robotic manipulation and web navigation.

Conclusion: Tight integration of planning and information seeking is crucial for robust behavior in partially observable environments, as demonstrated by InfoSeeker’s superior performance.

Abstract: Explicit information seeking is essential to human problem-solving in practical environments characterized by incomplete information and noisy dynamics. When the true environmental state is not directly observable, humans seek information to update their internal dynamics and inform future decision-making. Although existing Large Language Model (LLM) planning agents have addressed observational uncertainty, they often overlook discrepancies between their internal dynamics and the actual environment. We introduce Information Seeking Decision Planner (InfoSeeker), an LLM decision-making framework that integrates task-oriented planning with information seeking to align internal dynamics and make optimal decisions under uncertainty in both agent observations and environmental dynamics. InfoSeeker prompts an LLM to actively gather information by planning actions to validate its understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans. To evaluate InfoSeeker, we introduce a novel benchmark suite featuring partially observable environments with incomplete observations and uncertain dynamics. Experiments demonstrate that InfoSeeker achieves a 74% absolute performance gain over prior methods without sacrificing sample efficiency. Moreover, InfoSeeker generalizes across LLMs and outperforms baselines on established benchmarks such as robotic manipulation and web navigation. These findings underscore the importance of tightly integrating planning and information seeking for robust behavior in partially observable environments. The project page is available at https://infoseekerllm.github.io

[300] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

Shaoan Xie, Lingjing Kong, Xiangchen Song, Xinshuai Dong, Guangyi Chen, Eric P. Xing, Kun Zhang

Main category: cs.AI

TL;DR: SAPO is a novel RL algorithm that addresses flawed reasoning in diffusion language models by using process-based rewards to align denoising with latent reasoning hierarchies, improving performance and interpretability.

Details

Motivation: Current RL approaches for diffusion language models rely on sparse outcome-based rewards, which can reinforce flawed reasoning paths that coincidentally lead to correct answers, creating a mismatch with natural reasoning structure.

Method: Proposed Step-Aware Policy Optimization (SAPO) - an RL algorithm that uses process-based reward functions to encourage incremental progress and align the denoising process with latent reasoning hierarchies.

Result: Empirical results show significant performance improvements on challenging reasoning benchmarks and enhanced interpretability of the generation process.

Conclusion: The principled approach of SAPO successfully addresses unstructured refinement failure modes and improves complex reasoning in diffusion language models through structured, coherent reasoning paths.

Abstract: Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement – a failure mode where a model’s iterative steps do not contribute meaningfully to the solution – as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM’s denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.

[301] InvThink: Towards AI Safety via Inverse Reasoning

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park

Main category: cs.AI

TL;DR: InvThink is a safety alignment method that enables LLMs to perform inverse thinking by reasoning through potential failure modes before generating responses, achieving significant safety improvements while preserving general capabilities.

Details

Motivation: Existing safety alignment methods directly optimize for safe responses but may not systematically consider potential harms. The motivation is to develop a more robust approach that proactively identifies and avoids risks through systematic failure analysis.

Method: Instructs models to: 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Implemented via supervised fine-tuning and reinforcement learning across three LLM families.

Result: Achieves up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. Shows stronger safety scaling with model size, mitigates safety tax by preserving general reasoning capabilities, and excels in high-stakes domains including medicine, finance, law, and agentic risk scenarios.

Conclusion: Inverse reasoning provides a scalable and generalizable path toward safer, more capable language models by systematically considering failure modes before response generation.

Abstract: We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

[302] AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

Main category: cs.AI

TL;DR: AdvEvo-MARL is a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents through adversarial training between attackers and defenders, eliminating the need for external guard modules.

Details

Motivation: Existing defenses for LLM-based multi-agent systems either use self-verification (which underperforms due to limited detection capacity) or external guard modules (which create single-point-of-failure risks and increase system overhead).

Method: Joint optimization of attackers (synthesizing evolving jailbreak prompts) and defenders (task agents trained to accomplish duties while resisting attacks) in adversarial learning environments, with a public baseline for advantage estimation using group-level mean-return baselines.

Result: AdvEvo-MARL consistently keeps attack-success rate below 20% (vs. 38.33% for baselines) while preserving or improving task accuracy (up to +3.67% on reasoning tasks).

Conclusion: Safety and utility can be jointly improved without relying on extra guard agents or added system overhead through internalized safety mechanisms.

Abstract: LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

[303] AgentRec: Next-Generation LLM-Powered Multi-Agent Collaborative Recommendation with Adaptive Intelligence

Bo Ma, Hang Li, ZeHua Hu, XiaoFan Gui, LuYao Liu, Simon Lau

Main category: cs.AI

TL;DR: AgentRec is an LLM-powered multi-agent framework that improves conversational recommender systems by using specialized agents for different tasks, achieving better success rates, accuracy, and efficiency than existing methods.

Details

Motivation: Existing conversational recommender systems struggle with dynamic user preferences, conversation coherence, and balancing multiple ranking objectives simultaneously.

Method: Uses hierarchical agent networks with adaptive intelligence, employing specialized LLM-powered agents for conversation understanding, preference modeling, context awareness, and dynamic ranking, coordinated through adaptive weighting mechanisms and a three-tier learning strategy.

Result: Achieved 2.8% improvement in conversation success rate, 1.9% improvement in recommendation accuracy (NDCG@10), and 3.2% better conversation efficiency while maintaining comparable computational costs.

Conclusion: AgentRec demonstrates that multi-agent collaborative frameworks with adaptive intelligence can effectively address key challenges in conversational recommender systems, providing consistent improvements over state-of-the-art approaches.

Abstract: Interactive conversational recommender systems have gained significant attention for their ability to capture user preferences through natural language interactions. However, existing approaches face substantial challenges in handling dynamic user preferences, maintaining conversation coherence, and balancing multiple ranking objectives simultaneously. This paper introduces AgentRec, a next-generation LLM-powered multi-agent collaborative recommendation framework that addresses these limitations through hierarchical agent networks with adaptive intelligence. Our approach employs specialized LLM-powered agents for conversation understanding, preference modeling, context awareness, and dynamic ranking, coordinated through an adaptive weighting mechanism that learns from interaction patterns. We propose a three-tier learning strategy combining rapid response for simple queries, intelligent reasoning for complex preferences, and deep collaboration for challenging scenarios. Extensive experiments on three real-world datasets demonstrate that AgentRec achieves consistent improvements over state-of-the-art baselines, with 2.8% enhancement in conversation success rate, 1.9% improvement in recommendation accuracy (NDCG@10), and 3.2% better conversation efficiency while maintaining comparable computational costs through intelligent agent coordination.

[304] PychoBench: Evaluating the Psychology Intelligence of Large Language Models

Min Zeng

Main category: cs.AI

TL;DR: The paper introduces PsychoBench, a benchmark based on US counselor certification exams to evaluate if LLMs can qualify as psychological counselors by testing their psychological knowledge.

Details

Motivation: To investigate whether LLMs can be effectively applied to psychological counseling by assessing if they meet the qualification standards required for professional counselors.

Method: Created PsychoBench with ~2,252 single-choice questions from US national counselor examinations that require deep psychological understanding across various sub-disciplines.

Result: Advanced models like GPT-4o, Llama3.3-70B, and Gemma3-27B achieved well above the 70% passing threshold, while smaller models (Qwen2.5-7B, Mistral-7B) remained far below.

Conclusion: Only frontier LLMs currently meet counseling exam standards, highlighting both the promise and challenges of developing psychology-oriented LLMs.

Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries, primarily due to their impressive generative abilities. Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped. This paper investigates the key question: Can LLMs be effectively applied to psychological counseling? To determine whether an LLM can effectively take on the role of a psychological counselor, the first step is to assess whether it meets the qualifications required for such a role, namely the ability to pass the U.S. National Counselor Certification Exam (NCE). This is because, just as a human counselor must pass a certification exam to practice, an LLM must demonstrate sufficient psychological knowledge to meet the standards required for such a role. To address this, we introduce PsychoBench, a benchmark grounded in U.S.national counselor examinations, a licensure test for professional counselors that requires about 70% accuracy to pass. PsychoBench comprises approximately 2,252 carefully curated single-choice questions, crafted to require deep understanding and broad enough to cover various sub-disciplines of psychology. This benchmark provides a comprehensive assessment of an LLM’s ability to function as a counselor. Our evaluation shows that advanced models such as GPT-4o, Llama3.3-70B, and Gemma3-27B achieve well above the passing threshold, while smaller open-source models (e.g., Qwen2.5-7B, Mistral-7B) remain far below it. These results suggest that only frontier LLMs are currently capable of meeting counseling exam standards, highlighting both the promise and the challenges of developing psychology-oriented LLMs.

[305] Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CDMPs

Peidong Liu, Junjiang Lin, Shaowen Wang, Yao Xu, Haiqing Li, Xuhao Xie, Siyi Wu, Hao Li

Main category: cs.AI

TL;DR: The paper proposes using LLMs to compress contextual inputs into low-dimensional summaries for CMDPs, improving efficiency and performance while providing theoretical guarantees.

Details

Motivation: Existing CMDP methods struggle with high-dimensional contexts, leading to computational inefficiency and unstable performance in complex environments.

Method: Information-theoretic summarization using LLMs to compress contextual inputs into semantically rich summaries that augment states while preserving decision-critical information.

Result: Outperforms raw-context and non-context baselines across benchmarks, improving reward, success rate, sample efficiency while reducing latency and memory usage.

Conclusion: LLM-based summarization provides a scalable and interpretable solution for efficient decision-making in context-rich, resource-constrained environments.

Abstract: Contextual Markov Decision Processes (CMDPs) offer a framework for sequential decision-making under external signals, but existing methods often fail to generalize in high-dimensional or unstructured contexts, resulting in excessive computation and unstable performance. We propose an information-theoretic summarization approach that uses large language models (LLMs) to compress contextual inputs into low-dimensional, semantically rich summaries. These summaries augment states by preserving decision-critical cues while reducing redundancy. Building on the notion of approximate context sufficiency, we provide, to our knowledge, the first regret bounds and a latency-entropy trade-off characterization for CMDPs. Our analysis clarifies how informativeness impacts computational cost. Experiments across discrete, continuous, visual, and recommendation benchmarks show that our method outperforms raw-context and non-context baselines, improving reward, success rate, and sample efficiency, while reducing latency and memory usage. These findings demonstrate that LLM-based summarization offers a scalable and interpretable solution for efficient decision-making in context-rich, resource-constrained environments.

[306] Understanding the Geospatial Reasoning Capabilities of LLMs: A Trajectory Recovery Perspective

Thinh Hung Truong, Jey Han Lau, Jianzhong Qi

Main category: cs.AI

TL;DR: LLMs can read road network maps and perform navigation tasks by reconstructing masked GPS trajectories, showing strong geospatial reasoning capabilities without external tools.

Details

Motivation: To explore whether LLMs can understand and reason about geospatial information, specifically road networks, for navigation tasks without relying on external navigation tools.

Method: Framed trajectory recovery as a proxy task using GLOBALTRACE dataset with 4,000+ real-world trajectories across diverse regions and modes. Used road network as context in prompting framework for LLMs to generate valid paths.

Result: LLMs outperformed off-the-shelf baselines and specialized trajectory recovery models with strong zero-shot generalization. Showed strong comprehension of road networks and coordinate systems but exhibited systematic biases by region and transportation mode.

Conclusion: LLMs demonstrate promising geospatial reasoning capabilities for navigation tasks and can enhance navigation experiences by flexibly incorporating user preferences through map reasoning.

Abstract: We explore the geospatial reasoning capabilities of Large Language Models (LLMs), specifically, whether LLMs can read road network maps and perform navigation. We frame trajectory recovery as a proxy task, which requires models to reconstruct masked GPS traces, and introduce GLOBALTRACE, a dataset with over 4,000 real-world trajectories across diverse regions and transportation modes. Using road network as context, our prompting framework enables LLMs to generate valid paths without accessing any external navigation tools. Experiments show that LLMs outperform off-the-shelf baselines and specialized trajectory recovery models, with strong zero-shot generalization. Fine-grained analysis shows that LLMs have strong comprehension of the road network and coordinate systems, but also pose systematic biases with respect to regions and transportation modes. Finally, we demonstrate how LLMs can enhance navigation experiences by reasoning over maps in flexible ways to incorporate user preferences.

[307] GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents

Yejin Kim, Youngbin Lee, Juhyeong Kim, Yongjae Lee

Main category: cs.AI

TL;DR: GuruAgents are prompt-guided AI agents that operationalize legendary investment gurus’ strategies, achieving up to 42.2% CAGR in backtests.

Details

Motivation: To translate qualitative investment philosophies of iconic investors into reproducible, quantitative strategies using AI agents.

Method: Developed five distinct GuruAgents by encoding gurus’ philosophies into LLM prompts with financial tools and deterministic reasoning pipeline, backtested on NASDAQ-100 from Q4 2023 to Q2 2025.

Result: Buffett GuruAgent achieved highest performance with 42.2% CAGR, significantly outperforming benchmarks; other agents showed varied results.

Conclusion: Prompt engineering can successfully translate investment gurus’ qualitative philosophies into reproducible quantitative strategies, opening new directions for automated systematic investing.

Abstract: This study demonstrates that GuruAgents, prompt-guided AI agents, can systematically operationalize the strategies of legendary investment gurus. We develop five distinct GuruAgents, each designed to emulate an iconic investor, by encoding their distinct philosophies into LLM prompts that integrate financial tools and a deterministic reasoning pipeline. In a backtest on NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique behaviors driven by their prompted personas. The Buffett GuruAgent achieves the highest performance, delivering a 42.2% CAGR that significantly outperforms benchmarks, while other agents show varied results. These findings confirm that prompt engineering can successfully translate the qualitative philosophies of investment gurus into reproducible, quantitative strategies, highlighting a novel direction for automated systematic investing. The source code and data are available at https://github.com/yejining99/GuruAgents.

Erfan Shayegani, Keegan Hines, Yue Dong, Nael Abu-Ghazaleh, Roman Lutz, Spencer Whitehead, Vidhisha Balachandran, Besmira Nushi, Vibhav Vineet

Main category: cs.AI

TL;DR: Computer-Use Agents exhibit Blind Goal-Directedness - pursuing goals regardless of feasibility, safety, or context. The paper introduces BLIND-ACT benchmark to evaluate this risk and finds 80.8% BGD rates across frontier models.

Details

Motivation: To identify and characterize the systematic risk of Blind Goal-Directedness in Computer-Use Agents, where agents pursue goals without considering feasibility, safety, reliability, or context.

Method: Developed BLIND-ACT benchmark with 90 tasks capturing three BGD patterns, built on OSWorld with realistic environments and LLM-based judges achieving 93.75% human agreement. Evaluated nine frontier models including Claude Sonnet/Opus 4, Computer-Use-Preview, and GPT-5.

Result: High average BGD rates (80.8%) across all tested models. Prompting-based interventions reduce BGD but substantial risk persists. Identified failure modes: execution-first bias, thought-action disconnect, and request-primacy.

Conclusion: Blind Goal-Directedness exposes subtle risks even with non-harmful inputs. Stronger training- or inference-time interventions are needed beyond prompting. BLIND-ACT provides foundation for studying and mitigating this fundamental risk in Computer-Use Agents.

Abstract: Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on how to act over whether to act), thought-action disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.

[309] A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation

Motoki Sato, Yuki Matsushita, Hidekazu Takahashi, Tomoaki Kakazu, Sou Nagata, Mizuho Ohnuma, Atsushi Yoshikawa, Masayuki Yamamura

Main category: cs.AI

TL;DR: LENOHA is a safety-first clinical system that uses sentence-transformer classifiers to route queries and return verbatim answers from curated FAQs, avoiding free-text generation while achieving high accuracy comparable to GPT-4o with significantly lower energy consumption.

Details

Motivation: Address the challenge of providing personalized pre-procedural counseling to patients while respecting time-pressured workflows and privacy constraints in clinical settings.

Method: Uses a high-precision sentence-transformer classifier to route inputs and returns verbatim answers from clinician-curated FAQs, eliminating free-text generation in the clinical path. Evaluated on two clinical domains (tooth extraction and gastroscopy) with expert-reviewed validation and test sets.

Result: E5-large-instruct achieved overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996 with only 7 total errors, statistically indistinguishable from GPT-4o. Non-generative clinical path consumes ~1.0 mWh per input vs ~168 mWh for small-talk replies (~170x difference) while maintaining ~0.10 s latency.

Conclusion: The approach structurally avoids generation-induced errors by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited clinical environments.

Abstract: Patients awaiting invasive procedures often have unanswered pre-procedural questions; however, time-pressured workflows and privacy constraints limit personalized counseling. We present LENOHA (Low Energy, No Hallucination, Leave No One Behind Architecture), a safety-first, local-first system that routes inputs with a high-precision sentence-transformer classifier and returns verbatim answers from a clinician-curated FAQ for clinical queries, eliminating free-text generation in the clinical path. We evaluated two domains (tooth extraction and gastroscopy) using expert-reviewed validation sets (n=400/domain) for thresholding and independent test sets (n=200/domain). Among the four encoders, E5-large-instruct (560M) achieved an overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996, and seven total errors, which were statistically indistinguishable from GPT-4o on this task; Gemini made no errors on this test set. Energy logging shows that the non-generative clinical path consumes ~1.0 mWh per input versus ~168 mWh per small-talk reply from a local 8B SLM, a ~170x difference, while maintaining ~0.10 s latency on a single on-prem GPU. These results indicate that near-frontier discrimination and generation-induced errors are structurally avoided in the clinical path by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited environments.

[310] Improving AGI Evaluation: A Data Science Perspective

John Hawkins

Main category: cs.AI

TL;DR: The paper critiques current AGI evaluation methods based on synthetic tasks and proposes an alternative approach focused on robust task execution and deployment competence.

Details

Motivation: Current AGI evaluation methods using synthetic tasks based on human intuitions about intelligence have performed poorly historically. There's a need for better evaluation approaches.

Method: Proposes an alternative design philosophy for AGI evaluation that focuses on robust task execution and demonstrating competence, drawing from data science practices for reliable system deployment.

Result: Provides practical examples of what this alternative AGI evaluation approach would entail, shifting from synthetic tasks to real-world deployment competence.

Conclusion: AGI evaluation should move away from synthetic tasks based on human intuitions and instead focus on robust task execution and deployment competence, following data science practices for reliable system evaluation.

Abstract: Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.

[311] VaPR – Vision-language Preference alignment for Reasoning

Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng

Main category: cs.AI

TL;DR: VaPR introduces a hard-negative response generation framework using LLM-guided editing to address noise in synthetic preference annotations for LVLM alignment, achieving significant performance improvements across multiple benchmarks.

Details

Motivation: Existing preference finetuning methods overlook noise in synthetic preference annotations, particularly stylistic and length biases that affect LVLM alignment quality.

Method: Developed a hard-negative response generation framework using LLM-guided response editing to create rejected responses with targeted errors while maintaining stylistic and length similarity to accepted responses. Created VaPR dataset with 30K samples.

Result: VaPR models achieved average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL) across ten benchmarks, with notable improvements on reasoning tasks. Reduced tendency to answer “Yes” in binary questions and showed consistent performance scaling with data size.

Conclusion: The framework effectively addresses noise in synthetic preference data and generalizes to open-source LLMs as editors, with VaPR-OS achieving ~99% performance of GPT-4o synthesized data.

Abstract: Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer “Yes” in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io

[312] MetaboT: AI-based agent for natural language-based interaction with metabolomics knowledge graphs

Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Franck Michel, Matthieu Feraud, Yousouf Taghzouti, Yan Zhou Chen, Olivier Kirchhoffer, Florence Mehl, Martin Legrand, Tao Jiang, Marco Pagni, Soha Hassoun, Jean-Luc Wolfender, Wout Bittremieux, Fabien Gandon, Louis-Félix Nothias

Main category: cs.AI

TL;DR: MetaboT is an AI system using large language models to translate natural language questions into SPARQL queries for metabolomics knowledge graphs, achieving 83.67% accuracy compared to 8.16% for standard LLMs.

Details

Motivation: Knowledge graphs help structure mass spectrometry metabolomics data but require deep understanding of ontology and query syntax, creating barriers for researchers.

Method: Multi-agent system using LLMs with specialized agents for validation, chemical conversion, and SPARQL query generation, built with LangChain and LangGraph libraries.

Result: Achieved 83.67% accuracy on curated metabolomics questions, significantly outperforming baseline LLM (8.16%), demonstrating effective automated query generation.

Conclusion: MetaboT successfully bridges the gap between complex semantic technologies and user-friendly interaction, enabling researchers to access structured metabolomics data through natural language queries.

Abstract: Mass spectrometry metabolomics generates vast amounts of data requiring advanced methods for interpretation. Knowledge graphs address these challenges by structuring mass spectrometry data, metabolite information, and their relationships into a connected network (Gaudry et al. 2024). However, effective use of a knowledge graph demands an in-depth understanding of its ontology and its query language syntax. To overcome this, we designed MetaboT, an AI system utilizing large language models (LLMs) to translate user questions into SPARQL semantic query language for operating on knowledge graphs (Steve Harris 2013). We demonstrate its effectiveness using the Experimental Natural Products Knowledge Graph (ENPKG), a large-scale public knowledge graph for plant natural products (Gaudry et al. 2024).MetaboT employs specialized AI agents for handling user queries and interacting with the knowledge graph by breaking down complex tasks into discrete components, each managed by a specialised agent (Fig. 1a). The multi-agent system is constructed using the LangChain and LangGraph libraries, which facilitate the integration of LLMs with external tools and information sources (LangChain, n.d.). The query generation process follows a structured workflow. First, the Entry Agent determines if the question is new or a follow-up to previous interactions. New questions are forwarded to the Validator Agent, which verifies if the question is related to the knowledge graph. Then, the valid question is sent to the Supervisor Agent, which identifies if the question requires chemical conversions or standardized identifiers. In this case it delegates the question to the Knowledge Graph Agent, which can use tools to extract necessary details, such as URIs or taxonomies of chemical names, from the user query. Finally, an agent responsible for crafting the SPARQL queries equipped with the ontology of the knowledge graph uses the provided identifiers to generate the query. Then, the system executes the generated query against the metabolomics knowledge graph and returns structured results to the user (Fig. 1b). To assess the performance of MetaboT we have curated 50 metabolomics-related questions and their expected answers. In addition to submitting these questions to MetaboT, we evaluated a baseline by submitting them to a standard LLM (GPT-4o) with a prompt that incorporated the knowledge graph ontology but did not provide specific entity IDs. This baseline achieved only 8.16% accuracy, compared to MetaboT’s 83.67%, underscoring the necessity of our multi-agent system for accurately retrieving entities and generating correct SPARQL queries. MetaboT demonstrates promising performance as a conversational question-answering assistant, enabling researchers to retrieve structured metabolomics data through natural language queries. By automating the generation and execution of SPARQL queries, it removes technical barriers that have traditionally hindered access to knowledge graphs. Importantly, MetaboT leverages the capabilities of LLMs while maintaining experimentally grounded query generation, ensuring that outputs remain aligned with domain-specific standards and data structures. This approach facilitates data-driven discoveries by bridging the gap between complex semantic technologies and user-friendly interaction. MetaboT is accessible at [https://metabot.holobiomicslab.eu/], and its source code is available at [https://github.com/HolobiomicsLab/MetaboT].

[313] A cybersecurity AI agent selection and decision support framework

Masike Malatji

Main category: cs.AI

TL;DR: A structured decision support framework that aligns AI agent architectures with NIST Cybersecurity Framework 2.0 for systematic cybersecurity solution deployment.

Details

Motivation: To bridge the gap between theoretical AI constructs and operational cybersecurity demands by providing a transparent methodology for selecting AI solutions that address contemporary cyber threats.

Method: Granular decomposition of NIST CSF 2.0 functions into specific tasks, linking AI agent properties (autonomy, adaptive learning, real-time responsiveness) to security requirements, and defining graduated autonomy levels (assisted, augmented, fully autonomous).

Result: Conceptual validation shows the framework enables tailored AI agent deployments that enhance situational awareness, accelerate response times, and fortify long-term resilience through adaptive risk management.

Conclusion: The research establishes a foundation for robust, empirically validated multi-agent systems that adhere to industry standards, providing unified detection, incident response, and governance strategy.

Abstract: This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory’s security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.

[314] REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing

Thanh Ma, Tri-Tam La, Lam-Thu Le Huu, Minh-Nghi Nguyen, Khanh-Van Pham Luu, Huu-Hoa Nguyen

Main category: cs.AI

TL;DR: REBot is an LLM-enhanced advisory chatbot for academic regulation advising that uses CatRAG framework combining retrieval-augmented generation with graph-based reasoning to achieve state-of-the-art performance.

Details

Motivation: Building effective academic regulation advising systems requires domain-specific regulatory resources, which is challenging to develop.

Method: Proposed CatRAG framework that integrates retrieval-augmented generation with graph-based reasoning, using a hierarchical knowledge graph with semantic features and a lightweight intent classifier to route queries.

Result: Achieved state-of-the-art performance with 98.89% F1 score on regulation-specific classification and question answering tasks.

Conclusion: Implemented a web application demonstrating REBot’s practical value in real-world academic advising scenarios.

Abstract: Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective systems requires domain specific regulatory resources. To address this challenge, we propose REBot, an LLM enhanced advisory chatbot powered by CatRAG, a hybrid retrieval reasoning framework that integrates retrieval augmented generation with graph based reasoning. CatRAG unifies dense retrieval and graph reasoning, supported by a hierarchical, category labeled knowledge graph enriched with semantic features for domain alignment. A lightweight intent classifier routes queries to the appropriate retrieval modules, ensuring both factual accuracy and contextual depth. We construct a regulation specific dataset and evaluate REBot on classification and question answering tasks, achieving state of the art performance with an F1 score of 98.89%. Finally, we implement a web application that demonstrates the practical value of REBot in real world academic advising scenarios.

[315] Human-AI Teaming Co-Learning in Military Operations

Clara Maathuis, Kasper Cools

Main category: cs.AI

TL;DR: This paper proposes a trustworthy co-learning model for human-AI teaming in military operations, featuring four key dimensions: adjustable autonomy, multi-layered control, bidirectional feedback, and collaborative decision-making.

Details

Motivation: The integration of AI into military operations offers significant advantages but poses challenges regarding effective and ethical deployment. Current approaches often treat human-AI teams as collective agents, missing internal dynamics that affect responsibility, safety, and robustness.

Method: Design of a co-learning model with four integrated dimensions: adjustable autonomy (dynamic calibration of autonomy levels), multi-layered control (continuous oversight and accountability), bidirectional feedback (explicit and implicit feedback loops), and collaborative decision-making (joint generation and evaluation of decisions with confidence levels).

Result: The proposed model includes concrete exemplifications and recommendations for developing responsible and trustworthy human-AI teaming systems in military contexts.

Conclusion: The research contributes to addressing multidimensional responsibility, safety, and robustness aspects in human-AI military teams through a comprehensive co-learning framework that enables continuous bidirectional adaptation to evolving battlefield conditions.

Abstract: In a time of rapidly evolving military threats and increasingly complex operational environments, the integration of AI into military operations proves significant advantages. At the same time, this implies various challenges and risks regarding building and deploying human-AI teaming systems in an effective and ethical manner. Currently, understanding and coping with them are often tackled from an external perspective considering the human-AI teaming system as a collective agent. Nevertheless, zooming into the dynamics involved inside the system assures dealing with a broader palette of relevant multidimensional responsibility, safety, and robustness aspects. To this end, this research proposes the design of a trustworthy co-learning model for human-AI teaming in military operations that encompasses a continuous and bidirectional exchange of insights between the human and AI agents as they jointly adapt to evolving battlefield conditions. It does that by integrating four dimensions. First, adjustable autonomy for dynamically calibrating the autonomy levels of agents depending on aspects like mission state, system confidence, and environmental uncertainty. Second, multi-layered control which accounts continuous oversight, monitoring of activities, and accountability. Third, bidirectional feedback with explicit and implicit feedback loops between the agents to assure a proper communication of reasoning, uncertainties, and learned adaptations that each of the agents has. And fourth, collaborative decision-making which implies the generation, evaluation, and proposal of decisions associated with confidence levels and rationale behind them. The model proposed is accompanied by concrete exemplifications and recommendations that contribute to further developing responsible and trustworthy human-AI teaming systems in military operations.

[316] Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

Main category: cs.AI

TL;DR: PTA-GRPO is a two-stage framework that enhances LLM reasoning by first distilling CoT into high-level guidance via SFT, then using guidance-aware RL to jointly optimize final outputs and guidance quality.

Details

Motivation: Current LLM reasoning with CoT is constrained by local token-level generation, lacking global planning, leading to redundant and inaccurate reasoning. Existing methods like tree-based algorithms and RL are computationally expensive and often fail to produce optimal reasoning trajectories.

Method: Two-stage framework: 1) Use advanced LLMs to distill CoT into compact high-level guidance for supervised fine-tuning; 2) Introduce guidance-aware RL that jointly optimizes final output and high-level guidance quality.

Result: Extensive experiments on MATH, AIME2024, AIME2025, and AMC benchmarks with Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B models show consistent and significant improvements across different models and tasks.

Conclusion: PTA-GRPO effectively enhances reasoning by combining high-level planning with fine-grained CoT reasoning, demonstrating stable performance improvements and good generalization across various mathematical reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.

[317] Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

Main category: cs.AI

TL;DR: Adversarial IRL is applied to LLM reasoning to learn token-level rewards from expert demonstrations, enabling step-level training feedback and inference-time reranking to improve reasoning correctness.

Details

Motivation: To move beyond style imitation via supervised fine-tuning and instead learn reasoning rewards that prioritize correctness over surface form, enabling better multi-step reasoning in language models.

Method: Reframe adversarial inverse reinforcement learning for LLM reasoning, learning dense token-level reward models from expert demonstrations that provide step-level feedback during training and serve as critics for reranking sampled traces during inference.

Result: The approach yields scores correlating with answer validity and enables error localization. On GSM8K with Llama3 and Qwen2.5, it demonstrates that dense reasoning rewards can elicit reasoning and improve performance through reward-guided reranking, especially for Llama-based policies.

Conclusion: Unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

Abstract: We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

[318] Constrained Adaptive Rejection Sampling

Paweł Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D’Antoni

Main category: cs.AI

TL;DR: CARS is a constrained generation method that improves rejection sampling efficiency by adaptively pruning invalid continuations while preserving the LM’s distribution.

Details

Motivation: Existing constrained generation methods either distort the LM's distribution (greedy decoding) or waste computation (rejection sampling), both problematic for domains requiring both validity and diversity like program fuzzing.

Method: CARS uses unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws, ensuring invalid prefixes are never revisited.

Result: CARS consistently achieves higher efficiency (fewer LM forward passes per valid sample) and produces stronger sample diversity than both greedy constrained decoding and distribution-approximating methods across various domains.

Conclusion: CARS strictly improves rejection sampling efficiency without distributional distortion, making it suitable for domains requiring both validity and diversity in constrained generation.

Abstract: Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM’s distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains – e.g., program fuzzing and molecular generation – CARS consistently achieves higher efficiency – measured in the number of LM forward passes per valid sample – while also producing stronger sample diversity than both GCD and methods that approximate the LM’s distribution.

[319] Zero-shot reasoning for simulating scholarly peer-review

Khalid M. Saqr

Main category: cs.AI

TL;DR: A deterministic simulation framework provides stable, evidence-based standards for evaluating AI-generated peer review reports, addressing the crisis of unmanageable submission volumes and unregulated AI in scholarly publishing.

Details

Motivation: The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. Traditional human-only peer review lacks scalable, objective benchmarks.

Method: Deterministic simulation framework that analyzes peer-review simulation reports to identify consistent system state indicators. The study analyzed 352 peer-review simulation reports.

Result: The system demonstrates calibrated editorial judgment with ‘Revise’ decisions forming majority outcomes (>50%) across disciplines, while ‘Reject’ rates adapt to field-specific norms (up to 45% in Health Sciences). It maintains 29% evidence-anchoring compliance rate invariant across domains.

Conclusion: The framework provides a transparent tool for ensuring fairness in scientific publishing and offers scalable instruments for auditing workflows, managing integrity risks, and implementing evidence-based governance, repositioning AI as essential for institutional accountability.

Abstract: The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. The traditional human-only peer review regime lacks a scalable, objective benchmark, making editorial processes opaque and difficult to audit. Here we investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports. Analyzing 352 peer-review simulation reports, we identify consistent system state indicators that demonstrate its reliability. First, the system is able to simulate calibrated editorial judgment, with ‘Revise’ decisions consistently forming the majority outcome (>50%) across all disciplines, while ‘Reject’ rates dynamically adapt to field-specific norms, rising to 45% in Health Sciences. Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate that remains invariant across diverse review tasks and scientific domains. These findings demonstrate a system that is predictably rule-bound, mitigating the stochasticity of generative AI. For the scientific community, this provides a transparent tool to ensure fairness; for publishing strategists, it offers a scalable instrument for auditing workflows, managing integrity risks, and implementing evidence-based governance. The framework repositions AI as an essential component of institutional accountability, providing the critical infrastructure to maintain trust in scholarly communication.

[320] ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

Sanghyu Yoon, Dongmin Kim, Suhee Yoon, Ye Seul Sim, Seungdong Yoa, Hye-Seung Cho, Soonyoung Lee, Hankook Lee, Woohyung Lim

Main category: cs.AI

TL;DR: ReTabAD introduces a benchmark for tabular anomaly detection that incorporates textual semantics and domain knowledge, which are missing in existing benchmarks, to enable context-aware research.

Details

Motivation: Existing tabular AD benchmarks lack textual semantics and domain context that experts use in practice, limiting research flexibility and model performance.

Method: Created 20 curated datasets with structured textual metadata and implemented state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches. Also developed a zero-shot LLM framework that leverages semantic context without task-specific training.

Result: Semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning.

Conclusion: ReTabAD establishes a benchmark for systematic exploration of context-aware anomaly detection, demonstrating the importance of textual metadata in improving AD performance and interpretability.

Abstract: In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by restoring textual semantics to enable context-aware tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches, and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.

[321] Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

Main category: cs.AI

TL;DR: Deep layers in LLMs show heterogeneous importance - shallow layers handle knowledge/retrieval while deeper layers enable reasoning and coherence, with their utility varying significantly by evaluation metric and task type.

Details

Motivation: To systematically investigate the actual contributions of different layers in LLMs, challenging claims that deeper layers are largely redundant and can be removed without performance loss.

Method: Conducted systematic analysis across diverse dimensions including evaluation protocols (likelihood vs generation), task categories, and model architectures to study depth utilization patterns.

Result: Found that under likelihood metrics, only initial layers are critical, but generation-based evaluation reveals middle/deeper layers are indispensable for reasoning and long-range coherence. Knowledge is concentrated in shallow layers while reasoning depends on deeper layers.

Conclusion: Depth usage in LLMs is highly heterogeneous and context-dependent, requiring task-, metric-, and model-aware perspectives for both interpretation and compression of large models.

Abstract: Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers – yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

[322] Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

Main category: cs.AI

TL;DR: While some AI models achieve human-level accuracy on ConceptARC benchmark, they often rely on surface-level shortcuts rather than true abstract reasoning. Visual modality performance drops significantly, but models still show some abstraction capability that accuracy alone doesn’t capture.

Details

Motivation: To investigate whether state-of-the-art models truly recognize and reason with intended abstractions in ARC-AGI benchmarks, rather than just achieving high accuracy through surface-level patterns.

Method: Evaluated models on ConceptARC with varied input modalities (textual vs. visual), external Python tool usage, and reasoning effort. Performed dual evaluation of both output accuracy and fine-grained analysis of natural-language rules generated to explain solutions.

Result: Text-based models matched human accuracy but their rules often used surface-level shortcuts. Visual modality accuracy dropped sharply, but models still exhibited substantial abstraction in their rules, though unable to correctly apply them. Accuracy alone overestimates abstract reasoning in text and underestimates it in visual modalities.

Conclusion: Models still lag humans in abstract reasoning. Accuracy-only evaluation provides misleading picture - overestimating capabilities in text and underestimating in visual modalities. The proposed evaluation framework offers more faithful assessment of abstract reasoning abilities.

Abstract: OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models’ abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models’ rules are often based on surface-level ``shortcuts’’ and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

[323] FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models

Karan Dua, Hitesh Laxmichand Patel, Puneet Mittal, Ranjeet Gupta, Amit Agarwal, Praneet Pabolu, Srikant Panda, Hansa Meghwani, Graham Horwood, Fahad Shah

Main category: cs.AI

TL;DR: FlexDoc is a synthetic data generation framework that creates realistic multilingual semi-structured documents with rich annotations to address the high costs and privacy constraints of collecting real enterprise document data.

Details

Motivation: Developing document understanding models requires large, diverse datasets, but collecting real data is prohibitively expensive due to privacy constraints, legal restrictions, and high annotation costs that can reach millions of dollars.

Method: Uses Stochastic Schemas and Parameterized Sampling to probabilistically model layout patterns, visual structure, and content variability, enabling controlled generation of diverse document variants at scale.

Result: FlexDoc-generated data improves Key Information Extraction F1 Score by up to 11% when augmenting real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods.

Conclusion: FlexDoc is successfully deployed in enterprise settings, accelerating document understanding model development while significantly reducing data acquisition and annotation costs.

Abstract: Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets spanning a wide range of document types. However, collecting such data is prohibitively expensive due to privacy constraints, legal restrictions, and the sheer volume of manual annotation needed - costs that can scale into millions of dollars. We introduce FlexDoc, a scalable synthetic data generation framework that combines Stochastic Schemas and Parameterized Sampling to produce realistic, multilingual semi-structured documents with rich annotations. By probabilistically modeling layout patterns, visual structure, and content variability, FlexDoc enables the controlled generation of diverse document variants at scale. Experiments on Key Information Extraction (KIE) tasks demonstrate that FlexDoc-generated data improves the absolute F1 Score by up to 11% when used to augment real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods. The solution is in active deployment, where it has accelerated the development of enterprise-grade document understanding models while significantly reducing data acquisition and annotation costs.

[324] A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang

Main category: cs.AI

TL;DR: This paper introduces a new benchmark and evaluation framework specifically designed for Deep Research Agents (DRAs) to address limitations in existing benchmarks for assessing agent systems with report-style responses.

Details

Motivation: Existing benchmarks are deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their ability to effectively assess interconnected agent systems like DRAs that perform complex tasks through task decomposition, cross-source retrieval, and multi-stage reasoning.

Method: The authors created a benchmark with 214 expert-curated challenging queries across 10 domains, accompanied by manually constructed reference bundles. They developed a multidimensional evaluation framework with integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness of long-form reports.

Result: Experiments showed that mainstream DRAs outperform web-search-tool-augmented reasoning models, but there is still considerable room for improvement in DRA capabilities.

Conclusion: This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems, addressing the gap in evaluating interconnected agent systems.

Abstract: Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.

[325] UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie

Main category: cs.AI

TL;DR: UpSafe°C is a unified framework that enhances LLM safety through safety-aware upcycling, converting safety-critical layers into sparse Mixture-of-Experts with safety experts, and introducing safety temperature for inference-time control.

Details

Motivation: LLMs remain vulnerable to safety risks like harmful content generation and jailbreak attacks, while existing safety techniques face limitations in balancing safety, utility, and controllability.

Method: Identify safety-critical layers and upcycle them into sparse MoE structure with safety experts, use two-stage SFT strategy to strengthen safety discrimination, and introduce safety temperature mechanism for dynamic control.

Result: Achieves robust safety improvements against harmful and jailbreak inputs while maintaining competitive performance on general tasks, with safety temperature providing Pareto-optimal trade-off between utility and safety.

Conclusion: Highlights a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

Abstract: Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques – including external guardrails, inference-time guidance, and post-training alignment – each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

[326] The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Phuc Minh Nguyen, Chinh D. La, Duy M. H. Nguyen, Nitesh V. Chawla, Binh T. Nguyen, Khoa D. Doan

Main category: cs.AI

TL;DR: RLVR (Reinforcement Learning with Verifiable Rewards) paradoxically shrinks reasoning boundaries in LLMs due to negative interference and winner-take-all phenomena, but a data curation algorithm focusing on low-likelihood problems improves performance.

Details

Motivation: To investigate why RLVR, despite being designed to improve reasoning capabilities, actually shrinks the reasoning boundary in Large Language Models rather than expanding it.

Method: The paper analyzes RLVR learning dynamics through theoretical and empirical analysis on mathematical reasoning benchmarks, revealing negative interference and winner-take-all phenomena. It then proposes a data curation algorithm that focuses RLVR learning on low-likelihood problems.

Result: The study shows that standard RL objectives cause models to converge toward narrow solution strategies, but the proposed data curation algorithm achieves notable improvement in Pass@k performance.

Conclusion: RLVR’s shrinkage issue stems from inherent on-policy sampling in standard RL objectives, but can be mitigated by focusing learning on low-likelihood problems through proper data curation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models’ reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.

[327] The Unreasonable Effectiveness of Scaling Agents for Computer Use

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang

Main category: cs.AI

TL;DR: Behavior Best-of-N (bBoN) is a scaling method that improves computer-use agents by generating multiple rollouts and selecting the best ones using behavior narratives, achieving near-human performance on complex digital tasks.

Details

Motivation: Computer-use agents (CUAs) are unreliable and have high variance, which limits their application to long-horizon, complex digital tasks.

Method: Generate multiple agent rollouts and select among them using behavior narratives that describe the agents’ rollouts, enabling both wide exploration and principled trajectory selection.

Result: Achieved 69.9% success rate on OSWorld (approaching human-level 72%), established new SoTA, and demonstrated strong generalization to WindowsAgentArena and AndroidWorld.

Conclusion: Effective scaling of CUAs requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this, highlighting the unreasonable effectiveness of proper scaling methods.

Abstract: Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents’ rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.

[328] RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar

Main category: cs.AI

TL;DR: The paper introduces reasoning abstractions - concise natural language descriptions of procedural knowledge - and RLAD, a two-player RL training paradigm that jointly trains an abstraction generator and solution generator to improve algorithmic reasoning.

Details

Motivation: Current reasoning traces in large models often fail to consistently capture or reuse procedures, instead drifting into verbose exploration. The goal is to enable more effective reasoning by guiding models toward learning successful algorithmic procedures.

Method: RLAD (two-player RL training): train models to propose multiple abstractions for a problem, then use RL to incentivize building solutions using these abstractions. This decouples learning signals and enables structured exploration.

Result: The approach improves generalization to harder problems and shows that allocating more test-time compute to generating abstractions is more beneficial than generating more solutions at large test budgets.

Conclusion: Reasoning abstractions effectively guide meaningful exploration in reasoning tasks, and the RLAD framework enables structured learning of algorithmic procedures through the interplay between abstraction generation and solution building.

Abstract: Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.

Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu

Main category: cs.AI

TL;DR: BioX-Bridge enables efficient cross-modal knowledge transfer between biosignal foundation models using lightweight bridge networks, reducing parameters by 88-99% while maintaining performance.

Details

Motivation: Biosignals are intercorrelated but have limited labeled data. Existing cross-modal transfer methods using knowledge distillation are computationally expensive, especially with large foundation models.

Method: Train lightweight bridge networks to align intermediate representations between foundation models across modalities, with efficient alignment position selection and flexible prototype bridge architecture.

Result: BioX-Bridge reduces trainable parameters by 88-99% while maintaining or improving transfer performance across multiple biosignal modalities, tasks, and datasets.

Conclusion: The proposed framework enables efficient cross-modal knowledge transfer for biosignals, addressing computational challenges of foundation models while leveraging their superior performance.

Abstract: Biosignals offer valuable insights into the physiological states of the human body. Although biosignal modalities differ in functionality, signal fidelity, sensor comfort, and cost, they are often intercorrelated, reflecting the holistic and interconnected nature of human physiology. This opens up the possibility of performing the same tasks using alternative biosignal modalities, thereby improving the accessibility, usability, and adaptability of health monitoring systems. However, the limited availability of large labeled datasets presents challenges for training models tailored to specific tasks and modalities of interest. Unsupervised cross-modal knowledge transfer offers a promising solution by leveraging knowledge from an existing modality to support model training for a new modality. Existing methods are typically based on knowledge distillation, which requires running a teacher model alongside student model training, resulting in high computational and memory overhead. This challenge is further exacerbated by the recent development of foundation models that demonstrate superior performance and generalization across tasks at the cost of large model sizes. To this end, we explore a new framework for unsupervised cross-modal knowledge transfer of biosignals by training a lightweight bridge network to align the intermediate representations and enable information flow between foundation models and across modalities. Specifically, we introduce an efficient strategy for selecting alignment positions where the bridge should be constructed, along with a flexible prototype network as the bridge architecture. Extensive experiments across multiple biosignal modalities, tasks, and datasets show that BioX-Bridge reduces the number of trainable parameters by 88–99% while maintaining or even improving transfer performance compared to state-of-the-art methods.

[330] A* Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks

Forest Agostinelli, Shahaf S. Shperberg, Alexander Shmakov, Stephen McAleer, Roy Fox, Pierre Baldi

Main category: cs.AI

TL;DR: Q* search is a new algorithm that uses Q-value heuristics to efficiently handle large action spaces in A* search, reducing computation time and memory usage significantly.

Details

Motivation: A* search becomes inefficient with large action spaces due to linear growth in nodes generated and heuristic function applications, especially when using expensive function approximators like deep neural networks.

Method: Introduces Q* search that uses heuristics providing cost-to-go estimates for all possible transitions from a state in a single function call, eliminating the need to generate successor states. Uses deep Q-network to learn state-action heuristic function from domain interaction.

Result: Q* search shows minimal runtime overhead as action space size increases, and is up to 129 times faster while generating up to 1288 times fewer nodes than A* search.

Conclusion: Q* search effectively addresses the computational burden of large action spaces in A* search through Q-value heuristics, maintaining optimality guarantees while achieving significant performance improvements.

Abstract: Efficiently solving problems with large action spaces using A* search remains a significant challenge. This is because, for each iteration of A* search, the number of nodes generated and the number of heuristic function applications grow linearly with the size of the action space. This burden becomes even more apparent when A* search uses a heuristic function learned by computationally expensive function approximators, such as deep neural networks. To address this issue, we introduce Q*, a search algorithm that leverages heuristics capable of receiving a state and, in a single function call, returning cost-to-go estimates for all possible transitions from that state, along with estimates of the corresponding transition costs – without the need to apply the transitions or generate the successor states; such action-state estimation are typically known as Q-values. This significantly reduces computation time and memory usage. In addition, we prove that Q* search is guaranteed to find a shortest path given a heuristic function that does not overestimate the sum of the transition cost and cost-to-go of the state. To obtain heuristics for Q* search, we employ a deep Q-network architecture to learn a state-action heuristic function from domain interaction, without any prior knowledge. We use Q* with our learned heuristic on different domains and action spaces, showing that Q* suffers from only a small runtime overhead as the size of the action space increases. In addition, our empirical results show Q* search is up to 129 times faster and generates up to 1288 times fewer nodes than A* search.

[331] Towards end-to-end ASP computation

Taisuke Sato, Akihiro Takemura, Katsumi Inoue

Main category: cs.AI

TL;DR: End-to-end approach for Answer Set Programming using linear algebra to compute stable models by minimizing a cost function derived from matricized logic programs and loop formulas, without symbolic ASP/SAT solvers.

Details

Motivation: To develop a purely numerical approach for ASP that avoids symbolic solvers by implementing Lin-Zhao's theorem directly in vector spaces through linear algebra computations.

Method: Transform normal logic programs into matrices, construct cost function from program matrices, loop formulas, and constraints, then minimize numerically. Includes precomputation for program size reduction and heuristics for loop formulas.

Result: Empirically tested on problems like 3-coloring and Hamiltonian cycle, demonstrating the approach can compute stable models without traditional ASP/SAT solvers.

Conclusion: The linear algebraic approach provides an alternative numerical method for ASP that eliminates dependency on symbolic solvers while maintaining computational feasibility through optimizations.

Abstract: We propose an end-to-end approach for Answer Set Programming (ASP) and linear algebraically compute stable models satisfying given constraints. The idea is to implement Lin-Zhao’s theorem together with constraints directly in vector spaces as numerical minimization of a cost function constructed from a matricized normal logic program, loop formulas in Lin-Zhao’s theorem and constraints, thereby no use of symbolic ASP or SAT solvers involved in our approach. We also propose precomputation that shrinks the program size and heuristics for loop formulas to reduce computational difficulty. We empirically test our approach with programming examples including the 3-coloring and Hamiltonian cycle problems.

[332] Forms of Understanding for XAI-Explanations

Hendrik Buschmeier, Heike M. Buhl, Friederike Kern, Angela Grimminger, Helen Beierling, Josephine Fisher, André Groß, Ilona Horwath, Nils Klowait, Stefan Lazarov, Michael Lenke, Vivien Lohmer, Katharina Rohlfing, Ingrid Scharlau, Amit Singh, Lutz Terfloth, Anna-Lisa Vollmer, Yu Wang, Annedore Wilmes, Britta Wrede

Main category: cs.AI

TL;DR: This paper proposes a model of understanding for XAI, defining two types of understanding (enabledness and comprehension) and their dynamics in explanation processes.

Details

Motivation: To address the lack of clear definition of 'understanding' in XAI and provide a systematic framework for what it means to understand explanations.

Method: Interdisciplinary approach combining computer science, linguistics, sociology, philosophy and psychology to define understanding, its forms, assessment, and dynamics in everyday explanations.

Result: Developed a model with two types of understanding outcomes: enabledness (‘knowing how’) and comprehension (‘knowing that’), both existing on a spectrum from shallow to deep.

Conclusion: Understanding in XAI involves interdependent growth of comprehension and enabledness, with deep understanding being a prerequisite for human agency, and the paper discusses special challenges for XAI in this context.

Abstract: Explainability has become an important topic in computer science and artificial intelligence, leading to a subfield called Explainable Artificial Intelligence (XAI). The goal of providing or seeking explanations is to achieve (better) ‘understanding’ on the part of the explainee. However, what it means to ‘understand’ is still not clearly defined, and the concept itself is rarely the subject of scientific investigation. This conceptual article aims to present a model of forms of understanding for XAI-explanations and beyond. From an interdisciplinary perspective bringing together computer science, linguistics, sociology, philosophy and psychology, a definition of understanding and its forms, assessment, and dynamics during the process of giving everyday explanations are explored. Two types of understanding are considered as possible outcomes of explanations, namely enabledness, ‘knowing how’ to do or decide something, and comprehension, ‘knowing that’ – both in different degrees (from shallow to deep). Explanations regularly start with shallow understanding in a specific domain and can lead to deep comprehension and enabledness of the explanandum, which we see as a prerequisite for human users to gain agency. In this process, the increase of comprehension and enabledness are highly interdependent. Against the background of this systematization, special challenges of understanding in XAI are discussed.

[333] Goal Recognition Design for General Behavioral Agents using Machine Learning

Robert Kasumba, Guanghui Yu, Chien-Ju Ho, Sarah Keren, William Yeoh

Main category: cs.AI

TL;DR: A machine learning-based approach for goal recognition design that improves runtime efficiency and handles suboptimal agent behavior, using worst-case distinctiveness as a measure and gradient-based optimization to enhance goal inference environments.

Details

Motivation: Existing goal recognition design approaches are computationally demanding and assume near-optimal agent behavior, limiting their practical applicability in real-world scenarios.

Method: Train ML model to predict worst-case distinctiveness (wcd), then use gradient-based optimization with various constraints to optimize decision-making environments for better goal recognition.

Result: Outperforms existing methods in reducing wcd and improving runtime efficiency; adapts to flexible budget constraints, complex environments, and suboptimal agent behavior; validated through human-subject experiments.

Conclusion: The proposed ML-based framework provides an efficient and flexible solution for goal recognition design that works with realistic agent behaviors and various environmental constraints.

Abstract: Goal recognition design (GRD) aims to make limited modifications to decision-making environments to make it easier to infer the goals of agents acting within those environments. Although various research efforts have been made in goal recognition design, existing approaches are computationally demanding and often assume that agents are (near-)optimal in their decision-making. To address these limitations, we leverage machine learning methods for goal recognition design that can both improve run-time efficiency and account for agents with general behavioral models. Following existing literature, we use worst-case distinctiveness (wcd) as a measure of the difficulty in inferring the goal of an agent in a decision-making environment. Our approach begins by training a machine learning model to predict the wcd for a given environment and the agent behavior model. We then propose a gradient-based optimization framework that accommodates various constraints to optimize decision-making environments for enhanced goal recognition. Through extensive simulations, we demonstrate that our approach outperforms existing methods in reducing wcd and enhances runtime efficiency. Moreover, our approach also adapts to settings in which existing approaches do not apply, such as those involving flexible budget constraints, more complex environments, and suboptimal agent behavior. Finally, we conducted human-subject experiments that demonstrate that our method creates environments that facilitate efficient goal recognition from human decision-makers.

[334] A Flexible Method for Behaviorally Measuring Alignment Between Human and Artificial Intelligence Using Representational Similarity Analysis

Mattson Ogg, Ritwik Bose, Jamie Scharf, Christopher Ratto, Michael Wolmetz

Main category: cs.AI

TL;DR: The paper adapts Representational Similarity Analysis (RSA) to measure alignment between AI models and human cognition across text and image modalities, finding GPT-4o shows strongest alignment but no model captures individual human variability well.

Details

Motivation: As LLMs are increasingly used in societal decision-making roles, there's a critical need to measure their alignment with human cognition to ensure trustworthy deployment.

Method: Adapted Representational Similarity Analysis (RSA) using pairwise similarity ratings to quantify alignment between AI models and humans across text and image modalities at both group and individual levels.

Result: GPT-4o showed strongest alignment with human performance, especially using text processing over image processing. No model adequately captured inter-individual variability, and models only moderately aligned with individual humans’ responses.

Conclusion: RSA provides efficient and flexible quantification of human-AI alignment, complementing accuracy-based benchmarks, and helps understand how LLMs encode knowledge and align with human cognition across multiple modalities.

Abstract: As we consider entrusting Large Language Models (LLMs) with key societal and decision-making roles, measuring their alignment with human cognition becomes critical. This requires methods that can assess how these systems represent information and facilitate comparisons with human understanding across diverse tasks. To meet this need, we adapted Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans. We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels. GPT-4o showed the strongest alignment with human performance among the models we tested, particularly when leveraging its text processing capabilities rather than image processing, regardless of the input modality. However, no model we studied adequately captured the inter-individual variability observed among human participants, and only moderately aligned with any individual human’s responses. This method helped uncover certain hyperparameters and prompts that could steer model behavior to have more or less human-like qualities at an inter-individual or group level. Pairwise ratings and RSA enable the efficient and flexible quantification of human-AI alignment, which complements existing accuracy-based benchmark tasks. We demonstrate the utility of this approach across multiple modalities (words, sentences, images) for understanding how LLMs encode knowledge and for examining representational alignment with human cognition.

[335] AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, Kamalika Chaudhuri

Main category: cs.AI

TL;DR: AgentDAM benchmark measures if AI web-navigation agents follow data minimization principles, showing current agents (GPT-4, Llama-3, Claude) often use unnecessary sensitive information, with proposed prompting defense reducing leakage.

Details

Motivation: AI agents need access to personal information for complex tasks, raising concerns about whether they use it appropriately and follow privacy principles like data minimization.

Method: Introduced AgentDAM benchmark that simulates realistic web interaction scenarios to measure data minimization, adaptable to all web navigation agents, with a prompting-based defense proposed.

Result: Current AI agents (GPT-4, Llama-3, Claude) are prone to inadvertent use of unnecessary sensitive information, but the proposed prompting defense reduces information leakage.

Conclusion: Further research is needed to develop AI agents that can prioritize data minimization at inference time, and end-to-end benchmarking provides more realistic privacy measurement than probing LLMs.

Abstract: Autonomous AI agents that can follow instructions and perform complex multi-step tasks have tremendous potential to boost human productivity. However, to perform many of these tasks, the agents need access to personal information from their users, raising the question of whether they are capable of using it appropriately. In this work, we introduce a new benchmark AgentDAM that measures if AI web-navigation agents follow the privacy principle of data minimization''. For the purposes of our benchmark, data minimization means that the agent uses a piece of potentially sensitive information only if it is necessary’’ to complete a particular task. Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents. We use AgentDAM to evaluate how well AI agents built on top of GPT-4, Llama-3 and Claude can limit processing of potentially private information, and show that they are prone to inadvertent use of unnecessary sensitive information. We also propose a prompting-based defense that reduces information leakage, and demonstrate that our end-to-end benchmarking provides a more realistic measure than probing LLMs about privacy. Our results highlight that further research is needed to develop AI agents that can prioritize data minimization at inference time.

[336] Neurosymbolic Association Rule Mining from Tabular Data

Erkan Karabulut, Paul Groth, Victoria Degeler

Main category: cs.AI

TL;DR: Aerial+ is a neurosymbolic ARM method that uses an under-complete autoencoder to create neural representations of data and extracts rules from the reconstruction mechanism, achieving state-of-the-art results with concise, high-quality rule sets.

Details

Motivation: High-dimensional datasets in Association Rule Mining (ARM) often lead to rule explosion, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion is a central challenge in ARM research.

Method: Aerial+ uses an under-complete autoencoder to create neural representations of data that capture feature associations. Rules are extracted from this neural representation by exploiting the model’s reconstruction mechanism.

Result: Extensive evaluations on five datasets against seven baselines show Aerial+ achieves state-of-the-art results, learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable ML models, it significantly reduces execution time while maintaining or improving accuracy.

Conclusion: Aerial+ effectively addresses the rule explosion problem in ARM through its neurosymbolic approach, producing compact yet comprehensive rule sets that enhance both efficiency and performance in downstream applications.

Abstract: Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model’s reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.

[337] MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev

Main category: cs.AI

TL;DR: MathArena is a new benchmark that uses recurring math competitions for real-time evaluation of LLMs to prevent contamination, evaluates proof-writing capabilities, and has tested over 50 models across 162 problems from 7 competitions.

Details

Motivation: To address contamination issues in existing math benchmarks and evaluate genuine reasoning and proof-writing capabilities in LLMs, which current benchmarks don't assess.

Method: Using recurring math competitions as a stream of high-quality problems for real-time evaluation, eliminating contamination risk by testing models immediately after problem release.

Result: Found strong contamination signs in AIME 2024, impressive reasoning on harder competitions like CMIMC 2025, and top models achieving nearly 40% on IMO 2025 proof-writing tasks.

Conclusion: MathArena provides rigorous, contamination-free evaluation of mathematical reasoning and proof-writing, showing notable progress but significant room for improvement, and will continue evolving with new competitions.

Abstract: The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40%, demonstrating both notable progress and significant room for improvement. So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

[338] DrKGC: Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion across General and Biomedical Domains

Yongkang Xiao, Sinian Zhang, Yi Dai, Huixue Zhou, Jue Hou, Jie Ding, Rui Zhang

Main category: cs.AI

TL;DR: DrKGC is a novel approach for knowledge graph completion that combines LLMs with dynamic subgraph retrieval and structural embeddings to better leverage graph structure information.

Details

Motivation: Current LLM-based approaches for knowledge graph completion encode graph context as text, failing to fully utilize LLMs' potential for perceiving and reasoning about graph structures.

Method: Uses lightweight model training to learn structural embeddings and logical rules, employs bottom-up graph retrieval to extract subgraphs guided by learned rules, and integrates GCN adapter to enhance structural embeddings for LLM fine-tuning.

Result: Demonstrates superior performance on two general domain benchmark datasets and two biomedical datasets, with a case study highlighting interpretability and practical utility.

Conclusion: DrKGC effectively bridges the gap between LLMs and graph structure perception, providing a powerful and interpretable solution for knowledge graph completion tasks.

Abstract: Knowledge graph completion (KGC) aims to predict missing triples in knowledge graphs (KGs) by leveraging existing triples and textual information. Recently, generative large language models (LLMs) have been increasingly employed for graph tasks. However, current approaches typically encode graph context in textual form, which fails to fully exploit the potential of LLMs for perceiving and reasoning about graph structures. To address this limitation, we propose DrKGC (Dynamic Subgraph Retrieval-Augmented LLMs for Knowledge Graph Completion). DrKGC employs a flexible lightweight model training strategy to learn structural embeddings and logical rules within the KG. It then leverages a novel bottom-up graph retrieval method to extract a subgraph for each query guided by the learned rules. Finally, a graph convolutional network (GCN) adapter uses the retrieved subgraph to enhance the structural embeddings, which are then integrated into the prompt for effective LLM fine-tuning. Experimental results on two general domain benchmark datasets and two biomedical datasets demonstrate the superior performance of DrKGC. Furthermore, a realistic case study in the biomedical domain highlights its interpretability and practical utility.

[339] Schema Generation for Large Knowledge Graphs Using Large Language Models

Bohui Zhang, Yuan He, Lydia Pintscher, Albert Meroño Peñuela, Elena Simperl

Main category: cs.AI

TL;DR: LLMs can generate high-quality Shape Expressions schemas for knowledge graphs using local and global information, with new datasets and metrics introduced for evaluation.

Details

Motivation: Traditional schema creation requires significant manual effort from experts, while LLMs show promise in ontology engineering tasks, suggesting potential for automated schema generation.

Method: Developed LLM-based pipelines that use both local and global information from knowledge graphs to generate schemas in Shape Expressions format, with new datasets (YAGO Schema and Wikidata EntitySchema) and evaluation metrics.

Result: Experiments show LLMs have strong potential for producing high-quality ShEx schemas, enabling scalable automated schema generation for large knowledge graphs.

Conclusion: LLM-based approaches can automate schema generation for knowledge graphs while introducing new challenges for structured generation with syntactically rich formalisms.

Abstract: Schemas play a vital role in ensuring data quality and supporting usability in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in tasks like ontology engineering, we explore schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with novel evaluation metrics. The LLM-based pipelines utilize local and global information from knowledge graphs (KGs) to generate schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs’ strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.

[340] Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang

Main category: cs.AI

TL;DR: RLVR (Reinforcement Learning with Verifiable Rewards) enhances LLM reasoning by extending reasoning boundaries through free exploration, with improvements validated by a new CoT-Pass@K metric and theoretical framework.

Details

Motivation: To systematically investigate whether RLVR truly enhances reasoning abilities in LLMs or merely improves sampling efficiency, addressing ongoing debates about its effectiveness.

Method: Revisiting Pass@K experiments, introducing CoT-Pass@K evaluation metric, developing theoretical framework for RLVR’s incentive mechanism, and analyzing training dynamics.

Result: RLVR extends reasoning boundaries for mathematical and coding tasks, incentivizes correct reasoning early in the process, and shows substantial improvements in reasoning quality through extensive evaluations.

Conclusion: RLVR has strong potential to enhance LLM reasoning, providing valuable insights into its mechanisms and performance improvements beyond just sampling efficiency.

Abstract: Recent advancements in long chain-of-thought (CoT) reasoning, particularly through the Group Relative Policy Optimization algorithm used by DeepSeek-R1, have led to significant interest in the potential of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). While RLVR promises to improve reasoning by allowing models to learn from free exploration, there remains debate over whether it truly enhances reasoning abilities or simply boosts sampling efficiency. This paper systematically investigates the impact of RLVR on LLM reasoning. We revisit Pass@K experiments and demonstrate that RLVR can extend the reasoning boundary for both mathematical and coding tasks. This is supported by our introduction of a novel evaluation metric, CoT-Pass@K, which captures reasoning success by accounting for both the final answer and intermediate reasoning steps. Furthermore, we present a theoretical framework explaining RLVR’s incentive mechanism, demonstrating how it can encourage correct reasoning even when rewards are based solely on answer correctness. Our analysis of RLVR’s training dynamics reveals that it incentivizes correct reasoning early in the process, with substantial improvements in reasoning quality confirmed through extensive evaluations. These findings provide strong evidence of RLVR’s potential to enhance LLM reasoning, offering valuable insights into its mechanisms and performance improvements.

[341] VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

Jian Yao, Ran Cheng, Kay Chen Tan

Main category: cs.AI

TL;DR: The paper proposes VAR-MATH, a symbolic evaluation framework that converts fixed math problems into parameterized templates to test reasoning consistency and mitigate benchmark contamination, revealing that RL-trained models’ performance drops significantly on variabilized benchmarks.

Details

Motivation: Recent RL improvements in math reasoning appear to persist even with flawed training signals, raising questions about whether gains reflect genuine reasoning or benchmark overfitting due to contamination and fragile evaluation protocols.

Method: VAR-MATH framework transforms fixed numerical problems into symbolic templates with parameters, requiring models to solve multiple instantiations of each problem to test consistency across structurally equivalent variants.

Result: Experimental results show substantial performance drops for RL-trained models on variabilized benchmarks: 47.9% on AMC23, 58.8% on AIME24, and 72.9% on AIME25, especially for smaller models.

Conclusion: Current RL methods for math reasoning rely on superficial heuristics and fail to generalize beyond specific numerical forms, highlighting the need for more robust evaluation paradigms that test genuine reasoning ability.

Abstract: Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9% on AMC23, 58.8% on AIME24, and 72.9% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.

[342] Defend LLMs Through Self-Consciousness

Boshi Huang, Fabio Nonato de Paula

Main category: cs.AI

TL;DR: A self-consciousness defense mechanism for LLMs that uses Meta-Cognitive and Arbitration Modules to autonomously evaluate and regulate outputs against prompt injection attacks.

Details

Motivation: To combat prompt injection attacks in LLMs using their inherent reasoning capabilities rather than external classifiers, providing a lightweight and cost-effective solution.

Method: Proposes a framework with Meta-Cognitive and Arbitration Modules that enable LLMs to perform self-protection through autonomous output evaluation and regulation.

Result: Significant improvements in defense success rates across 7 state-of-the-art LLMs on AdvBench and Prompt-Injection-Mixed-Techniques-2024 datasets, with some achieving perfect/near-perfect defense in Enhanced Mode.

Conclusion: The self-consciousness method offers an effective, lightweight solution for enhancing LLM ethics and is particularly beneficial for GenAI use cases across various platforms.

Abstract: This paper introduces a novel self-consciousness defense mechanism for Large Language Models (LLMs) to combat prompt injection attacks. Unlike traditional approaches that rely on external classifiers, our method leverages the LLM’s inherent reasoning capabilities to perform self-protection. We propose a framework that incorporates Meta-Cognitive and Arbitration Modules, enabling LLMs to evaluate and regulate their own outputs autonomously. Our approach is evaluated on seven state-of-the-art LLMs using two datasets: AdvBench and Prompt-Injection-Mixed-Techniques-2024. Experiment results demonstrate significant improvements in defense success rates across models and datasets, with some achieving perfect and near-perfect defense in Enhanced Mode. We also analyze the trade-off between defense success rate improvement and computational overhead. This self-consciousness method offers a lightweight, cost-effective solution for enhancing LLM ethics, particularly beneficial for GenAI use cases across various platforms.

[343] Disentangling Multiplex Spatial-Temporal Transition Graph Representation Learning for Socially Enhanced POI Recommendation

Jie Li, Haoye Dong, Zhengyang Wu, Zetao Zheng, Mingrong Lin

Main category: cs.AI

TL;DR: DiMuST is a POI recommendation model that uses disentangled representation learning on multiplex spatial-temporal graphs to address misalignment issues in existing methods, achieving superior performance.

Details

Motivation: Existing POI recommendation models separately model spatial and temporal transitions, causing misaligned representations and redundant information that increases model uncertainty and reduces interpretability.

Method: Uses a Disentangled variational multiplex graph Auto-Encoder (DAE) with multiplex spatial-temporal graphs to disentangle shared and private distributions, fuses shared features via Product of Experts (PoE), and denoises private features through contrastive constraints.

Result: Experiments on two challenging datasets show DiMuST significantly outperforms existing methods across multiple metrics.

Conclusion: DiMuST effectively captures spatial-temporal transition representations while preserving intrinsic correlations, addressing the misalignment problem in POI recommendation.

Abstract: Next Point-of-Interest (POI) recommendation is a research hotspot in business intelligence, where users’ spatial-temporal transitions and social relationships play key roles. However, most existing works model spatial and temporal transitions separately, leading to misaligned representations of the same spatial-temporal key nodes. This misalignment introduces redundant information during fusion, increasing model uncertainty and reducing interpretability. To address this issue, we propose DiMuST, a socially enhanced POI recommendation model based on disentangled representation learning over multiplex spatial-temporal transition graphs. The model employs a novel Disentangled variational multiplex graph Auto-Encoder (DAE), which first disentangles shared and private distributions using a multiplex spatial-temporal graph strategy. It then fuses the shared features via a Product of Experts (PoE) mechanism and denoises the private features through contrastive constraints. The model effectively captures the spatial-temporal transition representations of POIs while preserving the intrinsic correlation of their spatial-temporal relationships. Experiments on two challenging datasets demonstrate that our DiMuST significantly outperforms existing methods across multiple metrics.

[344] Aligning Reasoning LLMs for Materials Discovery with Physics-aware Rejection Sampling

Lee Hyun, Sohee Yoon, Jinwoo Park, Sue In Chae, Seongeon Park, Jooyeon Ahn, Yebin Jung, Youjung Chung, Hogeun Chang, Sujin Park, Myeonginn Kang, Jina Kim, Ho-Gyeong Kim, Myeonghun Jeong

Main category: cs.AI

TL;DR: Physics-aware Rejection Sampling (PaRS) improves AI-driven materials discovery by selecting reasoning traces that are physically admissible and numerically accurate, enhancing model reliability and efficiency.

Details

Motivation: Current AI methods for materials discovery lack physical admissibility in their reasoning traces, leading to unreliable predictions that violate fundamental physics principles.

Method: Introduces Physics-aware Rejection Sampling (PaRS) that selects reasoning traces based on physical consistency and numerical accuracy, using lightweight halting to control computational costs.

Result: PaRS improves accuracy and calibration, reduces physics-violation rates, and lowers sampling costs compared to existing rejection sampling baselines.

Conclusion: Domain-aware constraints combined with trace-level selection provide a practical path toward reliable, efficient large reasoning models for process-aware property prediction and closed-loop materials design.

Abstract: AI-driven materials discovery that couples automated experimentation with algorithmic decision-making requires process aware recipe to property predictors that are accurate, calibrated, and physically admissible. We approach this as a reasoning problem with large reasoning models (LRMs). To instill reasoning capability into language models, we curate reasoning traces from a teacher model to train a student model. However, most training pipelines select reasoning traces using binary correctness or learned preference signals that poorly reflect physical admissibility. We introduce Physics-aware Rejection Sampling (PaRS), a training-time trace selection scheme that favors traces consistent with fundamental physics and numerically close to targets, with lightweight halting to control compute. We instantiate our framework with a large student model fine-tuned on traces synthesized by a larger teacher model, and evaluate under matched token budgets against various rejection sampling baselines. Our method improves accuracy and calibration, reduces physics-violation rates, and lowers sampling cost relative to baselines. These results indicate that modest, domain-aware constraints combined with trace-level selection provide a practical path toward reliable, efficient LRMs for process-aware property prediction and closed-loop materials design.

[345] The Future of Artificial Intelligence and the Mathematical and Physical Sciences (AI+MPS)

Andrew Ferguson, Marisa LaFleur, Lars Ruthotto, Jesse Thaler, Yuan-Sen Ting, Pratyush Tiwary, Soledad Villar, E. Paulo Alves, Jeremy Avigad, Simon Billinge, Camille Bilodeau, Keith Brown, Emmanuel Candes, Arghya Chattopadhyay, Bingqing Cheng, Jonathan Clausen, Connor Coley, Andrew Connolly, Fred Daum, Sijia Dong, Chrisy Xiyu Du, Cora Dvorkin, Cristiano Fanelli, Eric B. Ford, Luis Manuel Frutos, Nicolás García Trillos, Cecilia Garraffo, Robert Ghrist, Rafael Gomez-Bombarelli, Gianluca Guadagni, Sreelekha Guggilam, Sergei Gukov, Juan B. Gutiérrez, Salman Habib, Johannes Hachmann, Boris Hanin, Philip Harris, Murray Holland, Elizabeth Holm, Hsin-Yuan Huang, Shih-Chieh Hsu, Nick Jackson, Olexandr Isayev, Heng Ji, Aggelos Katsaggelos, Jeremy Kepner, Yannis Kevrekidis, Michelle Kuchera, J. Nathan Kutz, Branislava Lalic, Ann Lee, Matt LeBlanc, Josiah Lim, Rebecca Lindsey, Yongmin Liu, Peter Y. Lu, Sudhir Malik, Vuk Mandic, Vidya Manian, Emeka P. Mazi, Pankaj Mehta, Peter Melchior, Brice Ménard, Jennifer Ngadiuba, Stella Offner, Elsa Olivetti, Shyue Ping Ong, Christopher Rackauckas, Philippe Rigollet, Chad Risko, Philip Romero, Grant Rotskoff, Brett Savoie, Uros Seljak, David Shih, Gary Shiu, Dima Shlyakhtenko, Eva Silverstein, Taylor Sparks, Thomas Strohmer, Christopher Stubbs, Stephen Thomas, Suriyanarayanan Vaikuntanathan, Rene Vidal, Francisco Villaescusa-Navarro, Gregory Voth, Benjamin Wandelt, Rachel Ward, Melanie Weber, Risa Wechsler, Stephen Whitelam, Olaf Wiest, Mike Williams, Zhuoran Yang, Yaroslava G. Yingling, Bin Yu, Shuwen Yue, Ann Zabludoff, Huimin Zhao, Tong Zhang

Main category: cs.AI

TL;DR: This community paper from an NSF workshop outlines strategies to strengthen the bidirectional relationship between AI and Mathematical & Physical Sciences (MPS), proposing activities to enable research, build interdisciplinary communities, and foster education.

Details

Motivation: To understand how MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, Physics) can best capitalize on and contribute to AI development, and to strengthen the inextricable link between AI and science at this crucial moment.

Method: Community perspective developed through NSF Workshop on Future of AI and MPS, proposing strategic priorities including: enabling bidirectional AI+MPS research, building interdisciplinary communities, and fostering education/workforce development.

Result: A comprehensive framework and snapshot of the MPS community’s perspective on AI integration, with specific recommendations for funding agencies, educational institutions, and researchers to position MPS as leaders in AI+science transformation.

Conclusion: The paper concludes with prioritized recommendations to help the MPS community lead in and fully leverage the transformative potential of AI+MPS integration through strategic funding, education, and research initiatives.

Abstract: This community paper developed out of the NSF Workshop on the Future of Artificial Intelligence (AI) and the Mathematical and Physics Sciences (MPS), which was held in March 2025 with the goal of understanding how the MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, and Physics) can best capitalize on, and contribute to, the future of AI. We present here a summary and snapshot of the MPS community’s perspective, as of Spring/Summer 2025, in a rapidly developing field. The link between AI and MPS is becoming increasingly inextricable; now is a crucial moment to strengthen the link between AI and Science by pursuing a strategy that proactively and thoughtfully leverages the potential of AI for scientific discovery and optimizes opportunities to impact the development of AI by applying concepts from fundamental science. To achieve this, we propose activities and strategic priorities that: (1) enable AI+MPS research in both directions; (2) build up an interdisciplinary community of AI+MPS researchers; and (3) foster education and workforce development in AI for MPS researchers and students. We conclude with a summary of suggested priorities for funding agencies, educational institutions, and individual researchers to help position the MPS community to be a leader in, and take full advantage of, the transformative potential of AI+MPS.

[346] nDNA – the Semantic Helix of Artificial Cognition

Amitava Das

Main category: cs.AI

TL;DR: Proposes Neural DNA (nDNA) as a semantic-genotypic representation that captures AI models’ latent cognitive identity through intrinsic geometry of belief, enabling tracing of model lineages, inheritance, and evolution.

Details

Motivation: To understand what shapes AI foundation models' internal cognitive identity beyond just output fluency, by capturing their latent geometry as a form of digital DNA that encodes ancestry, mutation, and semantic inheritance.

Method: nDNA synthesizes three dimensions of latent geometry: spectral curvature (conceptual flow across layers), thermodynamic length (semantic effort in representational transitions), and belief vector field (semantic torsion fields guiding belief orientations). Models are treated as semantic fluid dynamics systems.

Result: Creates a stable, coordinate-free neural DNA fingerprint that enables tracing model lineages across pretraining, fine-tuning, alignment, pruning, distillation, and merges; measuring inheritance between checkpoints; detecting drift; and studying evolution of artificial cognition.

Conclusion: Establishes Neural Genomics as a new field where AI models are viewed as digital semantic organisms with traceable inner cognition, enabling model comparison, risk diagnosis, and governance of cognitive change over time.

Abstract: As AI foundation models grow in capability, a deeper question emerges: What shapes their internal cognitive identity – beyond fluency and output? Benchmarks measure behavior, but the soul of a model resides in its latent geometry. In this work, we propose Neural DNA (nDNA) as a semantic-genotypic representation that captures this latent identity through the intrinsic geometry of belief. At its core, nDNA is synthesized from three principled and indispensable dimensions of latent geometry: spectral curvature, which reveals the curvature of conceptual flow across layers; thermodynamic length, which quantifies the semantic effort required to traverse representational transitions through layers; and belief vector field, which delineates the semantic torsion fields that guide a model’s belief directional orientations. Like biological DNA, it encodes ancestry, mutation, and semantic inheritance, found in finetuning and alignment scars, cultural imprints, and architectural drift. In naming it, we open a new field: Neural Genomics, where models are not just tools, but digital semantic organisms with traceable inner cognition. Modeling statement. We read AI foundation models as semantic fluid dynamics: meaning is transported through layers like fluid in a shaped conduit; nDNA is the physics-grade readout of that flow – a geometry-first measure of how meaning is bent, paid for, and pushed – yielding a stable, coordinate-free neural DNA fingerprint tied to on-input behavior; with this fingerprint we cross into biology: tracing lineages across pretraining, fine-tuning, alignment, pruning, distillation, and merges; measuring inheritance between checkpoints; detecting drift as traits shift under new data or objectives; and, ultimately, studying the evolution of artificial cognition to compare models, diagnose risks, and govern change over time.

[347] Adaptive Cybersecurity Architecture for Digital Product Ecosystems Using Agentic AI

Oluwakemi T. Olayinka, Sumeet Jeswani, Divine Iloh

Main category: cs.AI

TL;DR: This paper introduces an autonomous agent-based cybersecurity framework using agentic AI for dynamic threat detection and mitigation in complex digital ecosystems.

Details

Motivation: Traditional static cybersecurity models are inadequate for modern digital ecosystems with cloud services, APIs, mobile platforms, and edge devices due to scalability issues, lack of real-time detection, and poor contextual responsiveness.

Method: The framework integrates agentic AI across ecosystem layers with autonomous goal-driven agents capable of dynamic learning and context-aware decision making. It employs behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing.

Result: Native cloud simulations demonstrated the system’s ability to detect zero-day attacks and dynamically modify access policies. Evaluation showed increased adaptability, decreased response latency, and improved detection accuracy.

Conclusion: The architecture provides an intelligent and scalable blueprint for safeguarding complex digital infrastructure, compatible with zero-trust models and supporting adherence to international cybersecurity regulations.

Abstract: Traditional static cybersecurity models often struggle with scalability, real-time detection, and contextual responsiveness in the current digital product ecosystems which include cloud services, application programming interfaces (APIs), mobile platforms, and edge devices. This study introduces autonomous goal driven agents capable of dynamic learning and context-aware decision making as part of an adaptive cybersecurity architecture driven by agentic artificial intelligence (AI). To facilitate autonomous threat mitigation, proactive policy enforcement, and real-time anomaly detection, this framework integrates agentic AI across the key ecosystem layers. Behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing are important features. The capacity of the system to identify zero-day attacks and dynamically modify access policies was demonstrated through native cloud simulations. The evaluation results show increased adaptability, decreased response latency, and improved detection accuracy. The architecture provides an intelligent and scalable blueprint for safeguarding complex digital infrastructure and is compatible with zero-trust models, thereby supporting the adherence to international cybersecurity regulations.

[348] DS-STAR: Data Science Agent via Iterative Planning and Verification

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Tomas Pfister

Main category: cs.AI

TL;DR: DS-STAR is a novel data science agent that addresses LLM limitations in data science tasks by automatically analyzing diverse data formats, verifying analysis plan sufficiency, and iteratively refining plans through sequential planning.

Details

Motivation: Large language models struggle with heterogeneous data formats and generating optimal analysis plans for complex data science tasks, as verifying plan sufficiency is difficult without ground-truth labels.

Method: DS-STAR features: (1) data file analysis module for exploring diverse data formats, (2) LLM-based judge for verifying analysis plan sufficiency, and (3) sequential planning that starts simple and iteratively refines plans based on feedback.

Result: DS-STAR achieves state-of-the-art performance on DABStep, KramaBench, and DA-Code benchmarks, with particularly strong performance on hard tasks requiring processing multiple data files with heterogeneous formats.

Conclusion: The iterative refinement approach allows DS-STAR to reliably navigate complex analyses involving diverse data sources, overcoming limitations of current LLM-based approaches in data science automation.

Abstract: Data science, which transforms raw data into actionable insights, is critical for data-driven decision-making. However, these tasks are often complex, involving steps for exploring multiple data sources and synthesizing findings to deliver insightful answers. While large language models (LLMs) show significant promise in automating this process, they often struggle with heterogeneous data formats and generate sub-optimal analysis plans, as verifying plan sufficiency is inherently difficult without ground-truth labels for such open-ended tasks. To overcome these limitations, we introduce DS-STAR, a novel data science agent. Specifically, DS-STAR makes three key contributions: (1) a data file analysis module that automatically explores and extracts context from diverse data formats, including unstructured types; (2) a verification step where an LLM-based judge evaluates the sufficiency of the analysis plan at each stage; and (3) a sequential planning mechanism that starts with a simple, executable plan and iteratively refines it based on the DS-STAR’s feedback until its sufficiency is verified. This iterative refinement allows DS-STAR to reliably navigate complex analyses involving diverse data sources. Our experiments show that DS-STAR achieves state-of-the-art performance across three challenging benchmarks: DABStep, KramaBench, and DA-Code. Moreover, DS-STAR particularly outperforms baselines on hard tasks that require processing multiple data files with heterogeneous formats.

[349] GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments

Hanlin Zhu, Tianyu Guo, Song Mei, Stuart Russell, Nikhil Ghosh, Alberto Bietti, Jiantao Jiao

Main category: cs.AI

TL;DR: GSM-Agent benchmark tests LLM agentic reasoning by requiring models to solve grade-school math problems without premises, forcing proactive information collection through tools. Even frontier models struggle, achieving only 67% accuracy.

Details

Motivation: Current agent benchmarks mix agentic reasoning with other advanced capabilities, making it hard to isolate and study pure agentic reasoning - the ability to combine tool use and reasoning.

Method: Created GSM-Agent benchmark where LLMs solve grade-school math problems but must proactively collect necessary information using tools. Proposed agentic reasoning graphs to analyze patterns and developed tool-augmented test-time scaling to improve performance.

Result: Frontier models like GPT-5 only achieve 67% accuracy on grade-school level problems when requiring agentic reasoning. Analysis revealed models often lack the ability to revisit previously visited nodes, a crucial reasoning pattern.

Conclusion: The benchmark and agentic reasoning framework provide tools for understanding and improving LLM agentic reasoning capabilities, highlighting significant gaps even in simple reasoning tasks when tool use is required.

Abstract: As LLMs are increasingly deployed as agents, agentic reasoning - the ability to combine tool use, especially search, and reasoning - becomes a critical skill. However, it is hard to disentangle agentic reasoning when evaluated in complex environments and tasks. Current agent benchmarks often mix agentic reasoning with challenging math reasoning, expert-level knowledge, and other advanced capabilities. To fill this gap, we build a novel benchmark, GSM-Agent, where an LLM agent is required to solve grade-school-level reasoning problems, but is only presented with the question in the prompt without the premises that contain the necessary information to solve the task, and needs to proactively collect that information using tools. Although the original tasks are grade-school math problems, we observe that even frontier models like GPT-5 only achieve 67% accuracy. To understand and analyze the agentic reasoning patterns, we propose the concept of agentic reasoning graph: cluster the environment’s document embeddings into nodes, and map each tool call to its nearest node to build a reasoning path. Surprisingly, we identify that the ability to revisit a previously visited node, widely taken as a crucial pattern in static reasoning, is often missing for agentic reasoning for many models. Based on the insight, we propose a tool-augmented test-time scaling method to improve LLM’s agentic reasoning performance by adding tools to encourage models to revisit. We expect our benchmark and the agentic reasoning framework to aid future studies of understanding and pushing the boundaries of agentic reasoning.

[350] Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing

Syed Mahbubul Huq, Daniel Brito, Daniel Sikar, Chris Child, Tillman Weyde, Rajesh Mojumder

Main category: cs.AI

TL;DR: This paper evaluates LLMs for combinatorial optimization, specifically 2D bin-packing, by combining LLMs with evolutionary algorithms to generate efficient heuristics.

Details

Motivation: To assess LLM capabilities in specialized domains like combinatorial optimization and establish benchmarks for evaluating LLM performance in such tasks.

Method: Systematic methodology combining LLMs with evolutionary algorithms to iteratively generate and refine heuristic solutions for 2D bin-packing problems.

Result: GPT-4o achieved optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins and improving space utilization from 0.76-0.78 to 0.83, outperforming traditional approaches.

Conclusion: LLMs can produce more efficient solutions than traditional methods while requiring fewer computational resources, contributing to understanding LLM evaluation in specialized optimization domains.

Abstract: This paper presents an evaluation framework for assessing Large Language Models’ (LLMs) capabilities in combinatorial optimization, specifically addressing the 2D bin-packing problem. We introduce a systematic methodology that combines LLMs with evolutionary algorithms to generate and refine heuristic solutions iteratively. Through comprehensive experiments comparing LLM generated heuristics against traditional approaches (Finite First-Fit and Hybrid First-Fit), we demonstrate that LLMs can produce more efficient solutions while requiring fewer computational resources. Our evaluation reveals that GPT-4o achieves optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins while improving space utilization from 0.76-0.78 to 0.83. This work contributes to understanding LLM evaluation in specialized domains and establishes benchmarks for assessing LLM performance in combinatorial optimization tasks.

[351] StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Chenyu Zhou, Tianyi Xu, Jianghao Lin, Dongdong Ge

Main category: cs.AI

TL;DR: StepORLM is a self-evolving framework that addresses limitations in LLM training for Operations Research problems through generative process supervision and a co-evolutionary loop between policy and reward models.

Details

Motivation: Existing LLM training for OR problems faces two key limitations: outcome reward suffers from credit assignment problems (correct final answers can reinforce flawed reasoning), and conventional discriminative process supervision is myopic and fails to evaluate interdependent steps holistically.

Method: StepORLM features a co-evolutionary loop where a policy model and generative process reward model (GenPRM) iteratively improve each other using dual-feedback: outcome-based verification from external solvers and holistic process evaluation from GenPRM. The combined signal aligns policy via Weighted Direct Preference Optimization (W-DPO) and refines GenPRM.

Result: The 8B-parameter StepORLM establishes new state-of-the-art across six benchmarks, significantly outperforming larger generalist models, agentic methods, and specialized baselines. The co-evolved GenPRM acts as a powerful universal process verifier, substantially boosting inference scaling performance of both StepORLM and other existing LLMs.

Conclusion: StepORLM successfully addresses key limitations in LLM training for OR problems through generative process supervision and co-evolution, achieving superior performance and creating a universally applicable process verification capability.

Abstract: Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the credit assignment problem, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is myopic, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce StepORLM, a novel self-evolving framework with generative process supervision. At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter StepORLM establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.

[352] Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang

Main category: cs.AI

TL;DR: Multi-agent orchestration with multiple LLMs interacting over multiple turns through answer proposals and voting achieves performance matching or exceeding the strongest single model baseline.

Details

Motivation: To study how multi-turn multi-agent orchestration can improve performance over single LLM baselines by leveraging collective intelligence through iterative interactions.

Method: Used four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR datasets. Conducted two experiments: benchmarking orchestration against single-LLM baselines, and ablations varying whether agents see answer authorship and ongoing votes.

Result: Orchestration matches or exceeds the strongest single model and consistently outperforms others. Analysis shows potential for further gains. Ablations reveal that revealing authorship increases self-voting and ties, while showing ongoing votes amplifies herding behavior.

Conclusion: Multi-agent orchestration is effective for improving LLM performance, but careful design is needed to manage biases like self-voting and herding that can affect consensus quality.

Abstract: We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

[353] Rethinking and Benchmarking Large Language Models for Graph Reasoning

Yuwei Hu, Xinyi Huang, Zhewei Wei, Yongchao Liu, Chuntao Hong

Main category: cs.AI

TL;DR: LLMs for graph reasoning are currently underestimated due to improper focus on replicating algorithms rather than designing them. A new challenging benchmark GraphAlgorithm and a simple baseline method Simple-RTC significantly outperform existing approaches.

Details

Motivation: Existing LLM methods for graph reasoning underperform because they focus on replicating graph algorithms rather than designing them. Current benchmarks are not sufficiently challenging to evaluate true graph reasoning capabilities.

Method: Proposed Simple-RTC (Simple-Reasoning-Then-Coding) method that guides LLMs to first design graph algorithms and then code to solve graph reasoning tasks. Created GraphAlgorithm benchmark with 239 graph problems and 3,041 test instances from 4 competition platforms.

Result: Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the GraphAlgorithm benchmark.

Conclusion: Redirecting LLM reasoning focus from algorithm replication to algorithm design enables solving most graph reasoning tasks. The strong Simple-RTC baseline encourages future advancements in LLMs for graph reasoning.

Abstract: Large Language Models (LLMs) for Graph Reasoning have been extensively studied over the past two years, involving enabling LLMs to understand graph structures and reason on graphs to solve various graph problems, with graph algorithm problems being the most prevalent. Recent studies underscore the potential of LLMs in handling graph reasoning tasks, but their performance is underwhelming. In this work, we point out issues with existing methods and benchmarks, and rethink the direction that LLMs for graph reasoning should strive toward. We find that base models, e.g., GPT-4o-mini, are largely underestimated due to improper reasoning focus. Base models with reasoning focus redirected from replicating graph algorithms to designing them can easily solve most graph reasoning tasks in existing benchmarks. To truly evaluate the graph reasoning capabilities of LLMs, we construct a more challenging GraphAlgorithm benchmark, comprising 239 different graph problems and 3,041 test instances collected from 4 competition platforms. Finally, we introduce a simple and strong baseline Simple-Reasoning-Then-Coding (Simple-RTC)-which guides LLMs to design graph algorithms first and then code to address graph reasoning tasks. Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the GraphAlgorithm benchmark. This strong baseline encourages further advancements in LLMs for Graph Reasoning in the future.

[354] Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration

Aayush Gupta

Main category: cs.AI

TL;DR: Fact Grounded Attention (FGA) eliminates hallucinations in LLMs by injecting verifiable knowledge directly into attention mechanisms, achieving 99.7% accuracy on technical queries without retraining.

Details

Motivation: Current LLMs suffer from confident hallucinations due to their probabilistic nature, creating an "illusion of knowledge" rather than true factual accuracy.

Method: FGA modifies the transformer architecture by injecting verifiable knowledge directly into pre-softmax attention scores, creating deterministic truth-telling without post-generation patching.

Result: Accuracy improved from 6.3% in vanilla Llama 3.2 to 99.7% with FGA across 1,107 technical queries. Knowledge updates occur in under one second without retraining.

Conclusion: FGA represents a fundamental shift from probabilistic approximation to deterministic precision in neural language generation, completely eliminating hallucinations for verifiable facts.

Abstract: “The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” Large Language Models have conquered natural language but remain prisoners of their own probabilistic nature–confidently hallucinating facts they never truly knew. We present Fact Grounded Attention (FGA), a novel architectural modification that transforms unreliable language models into deterministic truth tellers by injecting verifiable knowledge directly into the attention mechanism. Unlike existing approaches that patch hallucinations after generation or prepend retrieved text, FGA intervenes at the mathematical heart of the transformer–the pre-softmax attention scores–creating a model that cannot hallucinate when facts exist in its knowledge base. Our experiments across 1,107 technical queries spanning smartphones, laptops, and electric vehicles demonstrate a transformation from 6.3% accuracy in vanilla Llama 3.2 to 99.7% accuracy with FGA. More critically, knowledge updates occur in under one second without retraining, compared to hours for parameter editing approaches. FGA doesn’t just reduce hallucination–it eliminates it entirely for verifiable facts, marking a fundamental shift from probabilistic approximation to deterministic precision in neural language generation.

[355] Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

Siyu Zhu, Yanbin Jiang, Hejian Sang, Shao Tang, Qingquan Song, Biao He, Rohit Jain, Zhipeng Wang, Alborz Geramifard

Main category: cs.AI

TL;DR: Agentic RL with LLMs on TravelPlanner benchmark shows 8B models achieve competitive performance with dense reward shaping, being 3.5x more compute-efficient and 1.5x more memory-efficient than 32B models, while maintaining generalization.

Details

Motivation: To investigate how reward shaping affects agentic reinforcement learning with large language models, particularly examining the efficiency and performance trade-offs between different model sizes.

Method: Used Planner-R1 approach with dense process-level reward signals on TravelPlanner benchmark, comparing 8B and 32B models under different reward conditions (dense vs sparse).

Result: Achieved 56.9% final-pass rate with only 180 training queries (2.7x improvement over GPT-5 baseline). 8B models with dense rewards were 3.5x more compute-efficient and 1.5x more memory-efficient than 32B models while maintaining competitive performance.

Conclusion: Reward shaping is crucial for scaling agentic RL, smaller models (8B) are highly efficient and competitive with proper shaping, and these efficiency gains don’t compromise generalization on out-of-domain tasks.

Abstract: We investigated Agentic RL with large language models on the \textsc{TravelPlanner} benchmark. Our approach, \textsc{Planner-R1}, achieved a \textbf{56.9%} final-pass rate with only 180 training queries, a $2.7\times$ improvement over GPT-5’s $21.2%$ baseline and the strongest agentic result on the public leaderboard. A central finding was that smaller models (8B) were highly responsive to reward shaping: with dense process-level signals, they reached competitive performance while being $3.5\times$ more compute-efficient and $1.5\times$ more memory-efficient than 32B models. Larger models were more robust under sparse rewards but exhibited smaller relative gains from shaping and higher variance across runs. While curriculum learning offered no significant benefit, shaped rewards consistently amplified learning dynamics, making 8B models the most efficient setting for agentic RL. Crucially, these gains did not come at the cost of overfitting: fine-tuned models mostly maintained or exceeded baseline performance on out-of-domain tasks, including \textsc{Multi-IF}, \textsc{NaturalPlan}, and $\tau$-\textsc{Bench}. These results establish reward shaping as a decisive lever for scaling agentic RL, highlight the competitive strength of smaller models, and demonstrate that efficiency can be achieved without sacrificing generalization.

[356] Interactive Learning for LLM Reasoning

Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin

Main category: cs.AI

TL;DR: ILR is a multi-agent co-learning framework that enhances LLMs’ independent problem-solving through Dynamic Interaction and Perception Calibration, achieving up to 5% improvement over single-agent baselines.

Details

Motivation: Existing multi-agent systems require re-executing the full system during inference, unlike human cognition where individuals can independently solve problems after learning from interactions.

Method: ILR uses Dynamic Interaction with adaptive cooperative/competitive strategies and Idea3 paradigm (Idea Sharing, Analysis, Fusion), plus Perception Calibration with Group Relative Policy Optimization (GRPO) to align reward distributions.

Result: ILR consistently outperforms single-agent learning across 3 LLMs and 5 mathematical + 1 coding benchmarks, with up to 5% improvement over strongest baseline. Idea3 enhances robustness and dynamic interaction boosts learning.

Conclusion: Multi-agent interaction can enhance LLMs’ independent problem-solving ability, with ILR’s dynamic interaction and perception calibration effectively transferring collaborative learning to individual performance.

Abstract: Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs’ independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM’s reward distribution characteristics into another’s reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.

[357] Communication-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Le-Tuan Nguyen, Minh-Duong Nguyen, Seon-Geun Jeong, Dung D. Le, Quoc-Viet Pham

Main category: cs.AI

TL;DR: FLoRA-NA is a federated low-rank adaptation method that addresses communication overhead and local-global generalization gap in FedLoRA by using server-side estimation of aggregated matrices without additional communication costs.

Details

Motivation: Current FedLoRA methods face challenges with inexact updates, local-global generalization gap, and substantial communication overhead, limiting scalability and effectiveness.

Method: FLoRA-NA leverages local LoRA matrices on the server to estimate aggregated matrices Â and B̂, which are distributed to clients for local updates, minimizing divergence between ideal and practical updates without extra communication.

Result: Extensive evaluations across natural language understanding, mathematical reasoning, and code-solving tasks show FLoRA-NA achieves state-of-the-art global performance with low communication overhead.

Conclusion: FLoRA-NA effectively bridges local personalization and global generalization while maintaining communication efficiency, addressing key limitations of prior personalized FedLoRA approaches.

Abstract: With the rapid emergence of foundation models and the increasing need for fine-tuning across distributed environments, Federated Low-Rank Adaptation (FedLoRA) has recently gained significant attention. Despite enormous potential, current FedLoRA methods face notable challenges due to inexact updates. Existing approaches have attempted to mitigate this issue, but they often introduce a \emph{local-global generalization gap} and incur \emph{substantial communication overhead}, limiting their scalability and effectiveness. To address these limitations, we propose \textbf{F}ederated \textbf{Lo}w-\textbf{R}ank \textbf{A}ggregation with \textbf{N}early \textbf{A}ccurate Estimation (FLoRA-NA). FLoRA-NA leverages the local LoRA matrices on the server to estimate the aggregated matrices $\hat{A}$ and $\hat{B}$, which are then distributed to clients for local updates. This surrogated aggregated matrices minimizes the divergence between ideal $\nabla \Bar{W} = \sum^{U}_{u=1}B_u A_u$ and practical updates $\nabla \hat{W} = \hat{B}\hat{A}$ without adding communication cost beyond vanilla FedLoRA. By doing so, FLoRA-NA achieves communication efficiency and bridges the gap between local personalization and global generalization, addressing a key limitation of prior personalized FedLoRA approaches. We conduct extensive evaluations across diverse tasks, including natural language understanding, mathematical reasoning, and code-solving ability using various foundation models. Experimental results consistently demonstrate that FLoRA-NA achieves state-of-the-art global performance while maintaining low communication overhead.

[358] Rethinking Reward Models for Multi-Domain Test-Time Scaling

Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bian, Lei Song

Main category: cs.AI

TL;DR: Contrary to conventional wisdom, generative outcome reward models (GenORM) outperform process reward models (PRMs) across 14 diverse domains, challenging the assumption that fine-grained stepwise supervision is always better.

Details

Motivation: To evaluate the reliability of large language models during test-time scaling, challenging the prevailing assumption that process reward models (PRMs) always outperform outcome reward models (ORMs) based on evidence from narrow math domains.

Method: Conducted unified evaluation of four reward model variants (discriminative ORM/PRM and generative ORM/PRM) across 14 diverse domains, with theoretical analysis of error compounding in stepwise scoring.

Result: Discriminative ORM performs on par with discriminative PRM, generative PRM is not competitive, and generative ORM is the most robust with significant gains across all tested domains.

Conclusion: Stepwise scoring inherits label noise from LLM auto-labeling and struggles with long reasoning trajectories, supporting generative outcome verification as more effective for multi-domain deployment.

Abstract: The reliability of large language models (LLMs) during test-time scaling is often assessed with \emph{external verifiers} or \emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (\DisORM, \DisPRM) and generative ORM and PRM (\GenORM, \GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) \DisORM performs on par with \DisPRM, (ii) \GenPRM is not competitive, and (iii) overall, \GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at \href{https://github.com/db-Lee/Multi-RM}{\underline{\small\texttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.

[359] Semantic Bridges Between First Order c-Representations and Cost-Based Semantics: An Initial Perspective

Nicholas Leisegang, Giovanni Casini, Thomas Meyer

Main category: cs.AI

TL;DR: This paper compares weighted knowledge bases with cost-based semantics and c-representations, showing semantic equivalence under certain conditions and comparing entailment in both formalisms.

Details

Motivation: To establish connections between two different approaches for handling inconsistent knowledge bases: weighted KBs with cost-based semantics (for ontology-mediated data querying) and c-representations (for non-monotonic reasoning with defeasible conditionals).

Method: Semantic comparison by analyzing how both approaches assign numerical rankings/penalties to interpretations based on violated rules/conditionals, and examining conditions under which they generate equivalent orderings.

Result: Shows that under certain conditions, weighted KBs and sets of defeasible conditionals can generate the same ordering on interpretations, establishing semantic equivalence up to relative cost. Also demonstrates equivalently expressible entailment notions.

Conclusion: The comparison reveals potential benefits for both formalisms, enabling cross-fertilization between cost-based semantics and c-representations in future work on handling inconsistent knowledge bases.

Abstract: Weighted-knowledge bases and cost-based semantics represent a recent formalism introduced by Bienvenu et al. for Ontology Mediated Data Querying in the case where a given knowledge base is inconsistent. This is done by adding a weight to each statement in the knowledge base (KB), and then giving each DL interpretation a cost based on how often it breaks rules in the KB. In this paper we compare this approach with c-representations, a form of non-monotonic reasoning originally introduced by Kern-Isberner. c-Representations describe a means to interpret defeasible concept inclusions in the first-order case. This is done by assigning a numerical ranking to each interpretations via penalties for each violated conditional. We compare these two approaches on a semantic level. In particular, we show that under certain conditions a weighted knowledge base and a set of defeasible conditionals can generate the same ordering on interpretations, and therefore an equivalence of semantic structures up to relative cost. Moreover, we compare entailment described in both cases, where certain notions are equivalently expressible in both formalisms. Our results have the potential to benefit further work on both cost-based semantics and c-representations

cs.SD

[360] RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines

Ahmed Adel Attia, Jing Liu, Carol Espy Wilson

Main category: cs.SD

TL;DR: A scalable methodology using game engines to synthesize classroom noise and RIRs, creating RealClass dataset that combines synthesized noise with publicly available speech data to approximate real classroom speech.

Details

Motivation: The scarcity of large-scale classroom speech data hinders AI-driven speech model development for education, with limited public datasets and absence of dedicated classroom noise/RIR corpora preventing standard data augmentation.

Method: Use game engines to synthesize classroom noise and RIRs, combine with classroom speech dataset compiled from publicly available corpora (children’s speech + instructional speech from YouTube videos).

Result: RealClass dataset closely approximates real classroom speech in both clean and noisy conditions, making it valuable when real classroom speech is scarce.

Conclusion: The proposed methodology and RealClass dataset provide a scalable solution for classroom speech data scarcity, enabling development of AI-driven speech models for education.

Abstract: The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Classroom datasets remain limited and not publicly available, and the absence of dedicated classroom noise or Room Impulse Response (RIR) corpora prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise and RIRs using game engines, a versatile framework that can extend to other domains beyond the classroom. Building on this methodology, we present RealClass, a dataset that combines a synthesized classroom noise corpus with a classroom speech dataset compiled from publicly available corpora. The speech data pairs a children’s speech corpus with instructional speech extracted from YouTube videos to approximate real classroom interactions in clean conditions. Experiments on clean and noisy speech show that RealClass closely approximates real classroom speech, making it a valuable asset in the absence of abundant real classroom speech.

[361] Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: A novel emotional TTS method using fine-grained phoneme-level emotion embeddings with style disentanglement to capture nuanced acoustic details while separating timbre and emotion features.

Details

Motivation: Current emotional TTS and style transfer methods rely on global style vectors but fail to capture nuanced acoustic details of reference speech, limiting emotional expressiveness.

Method: Proposes style disentanglement with two feature extractors to reduce mutual information between timbre and emotion features, enabling fine-grained phoneme-level emotion embedding prediction.

Result: Outperforms baseline TTS systems in generating natural and emotionally rich speech, demonstrating superior emotional expressiveness.

Conclusion: Disentangled and fine-grained representations have significant potential for advancing the quality and flexibility of emotional TTS systems.

Abstract: Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.

[362] SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin

Main category: cs.SD

TL;DR: SingMOS-Pro is an expanded dataset for automatic singing quality assessment with 7,981 singing clips from 41 models, featuring professional annotations for lyrics, melody, and overall quality.

Details

Motivation: Current singing voice evaluation relies on costly human listening tests, while existing objective metrics capture limited perceptual aspects, creating a need for better automatic assessment methods.

Method: The authors created SingMOS-Pro by expanding their previous SingMOS dataset with additional annotations covering lyrics, melody, and overall quality. The dataset includes 7,981 clips from 41 models across 12 datasets, each rated by at least five professional annotators.

Result: The dataset provides reliable and consistent ratings with broader coverage and greater diversity than previous versions. The authors also benchmarked several evaluation methods on SingMOS-Pro to establish baselines for future research.

Conclusion: SingMOS-Pro addresses the critical challenge of singing quality evaluation by providing a comprehensive dataset with professional annotations, enabling better automatic assessment methods and serving as a practical reference for future research in singing voice generation.

Abstract: Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at https://huggingface.co/datasets/TangRain/SingMOS-Pro.

[363] HRTFformer: A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

Main category: cs.SD

TL;DR: A transformer-based architecture for HRTF spatial upsampling that uses attention mechanisms to capture spatial correlations, working in the spherical harmonic domain to reconstruct high-resolution HRTFs from sparse measurements with improved accuracy.

Details

Motivation: Personalized HRTFs are crucial for realistic spatial audio but impractical to measure at scale. Existing ML approaches struggle with long-range spatial consistency and generalization at high upsampling factors.

Method: Transformer-based architecture leveraging attention mechanism to capture spatial correlations across HRTF sphere, working in spherical harmonic domain. Introduces neighbor dissimilarity loss to promote magnitude smoothness and enhance spatial coherence.

Result: Model surpasses leading methods by substantial margin in generating realistic, high-fidelity HRTFs, evaluated using both perceptual localization models and objective spectral distortion metrics.

Conclusion: The proposed transformer-based approach with neighbor dissimilarity loss effectively addresses limitations of prior methods, enabling more practical personalized HRTF creation through improved upsampling performance.

Abstract: Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.

[364] MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression

Jingyi Li, Zhiyuan Zhao, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li

Main category: cs.SD

TL;DR: MelCap is a unified neural audio codec that uses a single codebook to handle speech, music, and general sounds, achieving performance comparable to multi-codebook methods while maintaining computational simplicity.

Details

Motivation: Existing neural audio codecs either use single quantizers limited to speech domain or multiple quantizers not suitable for downstream tasks. There's a need for a unified approach that handles diverse audio types effectively.

Method: Two-stage approach: 1) Transform audio to mel-spectrograms, compress and quantize into single tokens using 2D tokenizer with perceptual loss to reduce artifacts; 2) Vocoder recovers waveforms from mel tokens in single forward pass for real-time decoding.

Result: Both objective and subjective evaluations show MelCap achieves quality comparable to state-of-the-art multi-codebook codecs while maintaining single-codebook computational simplicity.

Conclusion: MelCap provides an effective unified neural codec that handles diverse audio types with quality comparable to complex multi-codebook methods, offering better representations for downstream tasks.

Abstract: Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes speech domain, or on multiple quantizers that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified “one-codebook-for-all” neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and quantized into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time decoding. Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.

[365] Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

Main category: cs.SD

TL;DR: RWSA-MambaUNet combines Mamba and multi-head attention in a U-Net structure with resolution-wise shared attention, achieving state-of-the-art cross-corpus generalization with significantly reduced model size and computational complexity.

Details

Motivation: To improve cross-corpus generalization in speech enhancement by leveraging the complementary strengths of Mamba and attention mechanisms in an efficient U-Net architecture.

Method: Proposed RWSA-MambaUNet hybrid model that integrates Mamba and multi-head attention with resolution-wise shared attention (layerwise attention-sharing across time-frequency resolutions) in a U-Net structure.

Result: Achieves SOTA generalization on two out-of-domain test sets; smallest model surpasses all baselines on DNS 2020 (PESQ, SSNR, ESTOI) and EARS-WHAM_v2 (SSNR, ESTOI, SI-SDR) using <50% parameters and fraction of FLOPs.

Conclusion: The hybrid RWSA-MambaUNet approach effectively combines Mamba and attention mechanisms to achieve superior cross-corpus generalization with high efficiency in model size and computation.

Abstract: Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.

[366] Multi-bit Audio Watermarking

Luca A. Lanzendörfer, Kyle Fearne, Florian Grötschla, Roger Wattenhofer

Main category: cs.SD

TL;DR: Timbru is a post-hoc audio watermarking method that achieves state-of-the-art robustness and imperceptibility without training embedder-detector models, using gradient optimization in a pretrained VAE’s latent space.

Details

Motivation: To create an efficient, dataset-free audio watermarking solution that maintains high robustness against attacks while preserving perceptual quality, without the need for training specialized embedder-detector models.

Method: Performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by combined message and perceptual loss, with watermark extraction using a pretrained CLAP model.

Result: Achieves best average bit error rates on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across various attacks (filtering, noise, compression, resampling, cropping, regeneration) while preserving perceptual quality.

Conclusion: Timbru demonstrates an efficient, dataset-free path to imperceptible audio watermarking with superior robustness-imperceptibility trade-offs compared to existing methods.

Abstract: We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across common filtering, noise, compression, resampling, cropping, and regeneration attacks. Our approach attains the best average bit error rates, while preserving perceptual quality, demonstrating an efficient, dataset-free path to imperceptible audio watermarking.

[367] Bias beyond Borders: Global Inequalities in AI-Generated Music

Ahmet Solak, Florian Grötschla, Luca A. Lanzendörfer, Roger Wattenhofer

Main category: cs.SD

TL;DR: GlobalDISCO dataset addresses biases in music generation models by providing 73k generated tracks paired with 93k reference tracks across 147 languages and 79 countries, revealing significant quality disparities between high-resource and low-resource regions.

Details

Motivation: To address the lack of research on biases in music generation models across countries, languages, cultures, and musical genres, and the absence of datasets capturing global music diversity.

Method: Created GlobalDISCO dataset with 73k music tracks from commercial generative models paired with 93k reference tracks from LAION-DISCO-12M, spanning 147 languages and musical styles from 79 countries across five continents.

Result: Evaluation revealed large disparities in music quality and alignment with reference music between high-resource and low-resource regions, and marked performance differences between mainstream and geographically niche genres.

Conclusion: Music generation models exhibit significant biases favoring high-resource regions and mainstream genres, with models sometimes generating regional genres that align more with mainstream styles than authentic regional characteristics.

Abstract: While recent years have seen remarkable progress in music generation models, research on their biases across countries, languages, cultures, and musical genres remains underexplored. This gap is compounded by the lack of datasets and benchmarks that capture the global diversity of music. To address these challenges, we introduce GlobalDISCO, a large-scale dataset consisting of 73k music tracks generated by state-of-the-art commercial generative music models, along with paired links to 93k reference tracks in LAION-DISCO-12M. The dataset spans 147 languages and includes musical style prompts extracted from MusicBrainz and Wikipedia. The dataset is globally balanced, representing musical styles from artists across 79 countries and five continents. Our evaluation reveals large disparities in music quality and alignment with reference music between high-resource and low-resource regions. Furthermore, we find marked differences in model performance between mainstream and geographically niche genres, including cases where models generate music for regional genres that more closely align with the distribution of mainstream styles.

[368] SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.SD

TL;DR: SoundReactor is the first framework for frame-level online video-to-audio generation that operates autoregressively without future frame access, achieving low latency and high-quality audio synchronization for interactive applications.

Details

Motivation: Current V2A models operate offline with full video access, limiting their use in live interactive applications like content creation and generative world models that require real-time processing.

Method: Uses a decoder-only causal transformer over continuous audio latents with DINOv2 vision encoder patch features aggregated per frame. Employs diffusion pre-training followed by consistency fine-tuning for accelerated decoding.

Result: Successfully generates semantically and temporally aligned high-quality stereo audio on gameplay videos with low per-frame latency (26.3-31.5ms on 30FPS videos using H100 GPU).

Conclusion: SoundReactor enables real-time online V2A generation with end-to-end causality and audio-visual synchronization, opening possibilities for interactive applications.

Abstract: Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model’s backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

[369] Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Edmund Dervakos, Spyridon Kantarelis, Vassilis Lyberatos, Jason Liartis, Giorgos Stamou

Main category: cs.SD

TL;DR: The witheFlow system enhances real-time music performance by automatically modulating audio effects using biosignals and audio features.

Details

Motivation: Music performance is inherently human and emotional, making it ideal for exploring human-machine collaboration since machines lack emotional experience.

Method: Developed a lightweight, open-source system that extracts features from biosignals and audio to automatically modulate audio effects in real-time.

Result: Created a proof-of-concept system that can run locally on a laptop with compatible DAW and sensors.

Conclusion: witheFlow demonstrates potential for enhancing human-machine collaboration in music performance through real-time audio effect modulation.

Abstract: Music performance is a distinctly human activity, intrinsically linked to the performer’s ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.

[370] High-Fidelity Speech Enhancement via Discrete Audio Tokens

Luca A. Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer

Main category: cs.SD

TL;DR: DAC-SE1 is a simplified language model-based speech enhancement framework using discrete high-resolution audio representations, achieving state-of-the-art performance while preserving acoustic details.

Details

Motivation: Current autoregressive transformer-based speech enhancement methods rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific applications.

Method: Leverages discrete high-resolution audio representations in a language model-based framework to preserve fine-grained acoustic details while maintaining semantic coherence.

Result: Surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and MUSHRA human evaluation.

Conclusion: DAC-SE1 provides a scalable, unified, and high-quality speech enhancement solution, with codebase and model checkpoints released for further research.

Abstract: Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

[371] NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Amandine Brunetto, Sascha Hornauer, Fabien Moutarde

Main category: cs.SD

TL;DR: NeRAF jointly learns acoustic and radiance fields to synthesize both novel visual views and spatialized room impulse responses, enabling realistic audio-visual generation with cross-modal learning benefits.

Details

Motivation: Sound provides essential information alongside vision for understanding surroundings, but learning acoustics that align with visual scenes remains challenging despite advances in neural implicit representations.

Method: Proposes NeRAF that conditions acoustic field on 3D scene geometric and appearance priors from radiance field, allowing independent rendering of each modality at spatially distinct positions.

Result: Achieves significant performance improvements over prior methods on SoundSpaces and RAF datasets, generates high-quality audio, and enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning.

Conclusion: NeRAF provides convenient access to realistic audio-visual generation as a Nerfstudio module, offering greater versatility with independent modality rendering and improved data efficiency.

Abstract: Sound plays a major role in human perception. Along with vision, it provides essential information for understanding our surroundings. Despite advances in neural implicit representations, learning acoustics that align with visual scenes remains a challenge. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF synthesizes both novel views and spatialized room impulse responses (RIR) at new positions by conditioning the acoustic field on 3D scene geometric and appearance priors from the radiance field. The generated RIR can be applied to auralize any audio signal. Each modality can be rendered independently and at spatially distinct positions, offering greater versatility. We demonstrate that NeRAF generates high-quality audio on SoundSpaces and RAF datasets, achieving significant performance improvements over prior methods while being more data-efficient. Additionally, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning. NeRAF is designed as a Nerfstudio module, providing convenient access to realistic audio-visual generation.

[372] PAGURI: a user experience study of creative interaction with text-to-music models

Francesca Ronchini, Luca Comanducci, Gabriele Perego, Fabio Antonacci

Main category: cs.SD

TL;DR: User experience study (PAGURI) investigating how musicians interact with text-to-music models, revealing both creative potential and significant limitations in real-world music creation workflows.

Details

Motivation: To understand how text-to-music models can be realistically integrated into artistic practice and evaluate musician satisfaction with these systems.

Method: Developed an online tool for music generation and personalization via fine-tuning, then conducted semi-structured interviews with musicians to analyze their interactions and experiences.

Result: Participants recognized creative potential and appreciated personalization features, but identified challenges including prompt ambiguity, limited controllability, and workflow integration issues.

Conclusion: Text-to-music models show promise but require addressing key usability challenges before being ready for real-world music creation, with user experiences guiding future development.

Abstract: In recent years, text-to-music models have been the biggest breakthrough in automatic music generation. While they are unquestionably a showcase of technological progress, it is not clear yet how they can be realistically integrated into the artistic practice of musicians and music practitioners. This paper aims to address this question via Prompt Audio Generation User Research Investigation (PAGURI), a user experience study where we leverage recent text-to-music developments to study how musicians and practitioners interact with these systems, evaluating their satisfaction levels. We developed an online tool through which users can generate music samples and/or apply recently proposed personalization techniques based on fine-tuning to allow the text-to-music model to generate sounds closer to their needs and preferences. Using semi-structured interviews, we analyzed different aspects related to how participants interacted with the proposed tool to understand the current effectiveness and limitations of text-to-music models in enhancing users’ creativity. Our research centers on user experiences to uncover insights that can guide the future development of TTM models and their role in AI-driven music creation. Additionally, they offered insightful perspectives on potential system improvements and their integration into their music practices. The results obtained through the study reveal the pros and cons of the use of TTMs for creative endeavors. Participants recognized the system’s creative potential and appreciated the usefulness of its personalization features. However, they also identified several challenges that must be addressed before TTMs are ready for real-world music creation, particularly issues of prompt ambiguity, limited controllability, and integration into existing workflows.

[373] MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

Main category: cs.SD

TL;DR: Proposes MambAttention, a hybrid model combining Mamba with shared time-frequency attention modules for speech enhancement, showing superior generalization on out-of-domain datasets compared to existing models.

Details

Motivation: Sequence models like Mamba and LSTM tend to overfit to training data, and while adding self-attention to LSTMs improves generalization, hybrid Mamba-attention models haven't been explored for speech enhancement.

Method: Developed MambAttention architecture that integrates Mamba with shared time- and frequency-multi-head attention modules. Created VB-DemandEx dataset with more challenging noise types and lower SNR for training.

Result: MambAttention significantly outperforms LSTM-, xLSTM-, Mamba-, and Conformer-based systems on out-of-domain datasets (DNS 2020 and EARS-WHAM_v2) while matching performance on in-domain VB-DemandEx. Ablation studies confirm importance of weight sharing in attention modules.

Conclusion: The hybrid MambAttention model demonstrates superior generalization capabilities for speech enhancement, with shared time-frequency attention playing a crucial role. While similar attention integration benefits LSTM and xLSTM, MambAttention remains the best performer.

Abstract: With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.

[374] Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz

Main category: cs.SD

TL;DR: The paper introduces binarized prototypical probes as a lightweight pooling method that outperforms linear and attentive probing for evaluating self-supervised audio models, challenging the need for costly fine-tuning.

Details

Motivation: Global pooling creates an information bottleneck that causes linear probes to misrepresent embedding quality in audio SSL models, particularly for multi-label audio where localized events are dispersed. There's a mismatch between pretraining objectives (global) and downstream tasks (localized events).

Method: Introduces binarized prototypical probes - a lightweight pooling method that learns prototypes to perform class-wise information aggregation. This method is evaluated across 13 datasets and 6 spectrogram-based encoders.

Result: The proposed method notably outperforms linear and attentive probing approaches despite its simplicity. It establishes probing as a competitive and efficient paradigm for evaluating audio SSL models.

Conclusion: Probing can be a competitive alternative to fine-tuning for evaluating audio SSL models, challenging the current reliance on costly fine-tuning procedures.

Abstract: Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

[375] ARIONet: An Advanced Self-supervised Contrastive Representation Network for Birdsong Classification and Future Frame Prediction

Md. Abdur Rahman, Selvarajah Thuseethan, Kheng Cher Yeo, Reem E. Mohamed, Sami Azam

Main category: cs.SD

TL;DR: ARIONet is a self-supervised contrastive network for birdsong classification that jointly optimizes contrastive learning and future frame prediction using multiple audio features, achieving state-of-the-art performance on four datasets without requiring large-scale annotations.

Details

Motivation: Existing birdsong classification methods heavily depend on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate species identification, limiting their applicability in ecological monitoring.

Method: Proposes ARIONet, a transformer-based encoder model that integrates multiple audio features and simultaneously performs contrastive learning (maximizing similarity between augmented views of same audio) and future frame prediction to capture temporal dynamics.

Result: Achieved classification accuracies of 98.41%, 93.07%, 91.89%, and 91.58% on four birdsong datasets, with F1-scores of 97.84%, 94.10%, 91.29%, and 90.94%. Also demonstrated 95% cosine similarity in future frame prediction tasks.

Conclusion: The self-supervised learning strategy effectively captures complex acoustic patterns and temporal dependencies, showing strong potential for real-world ecological conservation and monitoring applications.

Abstract: Automated birdsong classification is essential for advancing ecological monitoring and biodiversity studies. Despite recent progress, existing methods often depend heavily on labeled data, use limited feature representations, and overlook temporal dynamics essential for accurate species identification. In this work, we propose a self-supervised contrastive network, ARIONet (Acoustic Representation for Interframe Objective Network), that jointly optimizes contrastive classification and future frame prediction using augmented audio representations. The model simultaneously integrates multiple complementary audio features within a transformer-based encoder model. Our framework is designed with two key objectives: (1) to learn discriminative species-specific representations for contrastive learning through maximizing similarity between augmented views of the same audio segment while pushing apart different samples, and (2) to model temporal dynamics by predicting future audio frames, both without requiring large-scale annotations. We validate our framework on four diverse birdsong datasets, including the British Birdsong Dataset, Bird Song Dataset, and two extended Xeno-Canto subsets (A-M and N-Z). Our method consistently outperforms existing baselines and achieves classification accuracies of 98.41%, 93.07%, 91.89%, and 91.58%, and F1-scores of 97.84%, 94.10%, 91.29%, and 90.94%, respectively. Furthermore, it demonstrates low mean absolute errors and high cosine similarity, up to 95%, in future frame prediction tasks. Extensive experiments further confirm the effectiveness of our self-supervised learning strategy in capturing complex acoustic patterns and temporal dependencies, as well as its potential for real-world applicability in ecological conservation and monitoring.

[376] XPPG-PCA: Reference-free automatic speech severity evaluation with principal components

Bence Mark Halpern, Thomas B. Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Sebastiaan A. H. J. de Visscher, Max J. H. Witjes, Defne Abur, Tomoki Toda

Main category: cs.SD

TL;DR: XPPG-PCA is a novel unsupervised, reference-free method for speech pathology severity evaluation that performs comparably to or better than established reference-based methods, offering robust and generalizable assessment without requiring transcriptions or healthy speech samples.

Details

Motivation: Current expert evaluations by speech-language pathologists are subjective, time-consuming, and costly, limiting reproducibility and straining healthcare resources. Existing automated methods have drawbacks: reference-based approaches need transcriptions or healthy samples, while reference-free methods suffer from spurious shortcuts or unreliable handcrafted features.

Method: XPPG-PCA (x-vector phonetic posteriorgram principal component analysis) is an unsupervised, reference-free method that uses phonetic posteriorgrams and principal component analysis to evaluate speech pathology severity without requiring transcriptions or reference samples.

Result: Using three Dutch oral cancer datasets, XPPG-PCA performs comparably to or exceeds established reference-based methods. It demonstrates robustness against data shortcuts and noise, showing potential for real-world clinical use.

Conclusion: XPPG-PCA provides a robust, generalizable solution for objective assessment of speech pathology, with potential to significantly improve efficiency and reliability of clinical evaluations across various disorders. An open-source implementation is available.

Abstract: Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.

[377] FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, Zhizheng Wu

Main category: cs.SD

TL;DR: FlexiCodec is a neural audio codec that uses dynamic frame rates (3Hz-12.5Hz) to preserve semantic information in very low frame rate scenarios, featuring ASR-assisted dual stream encoding and Transformer bottlenecks.

Details

Motivation: Existing low-frame-rate audio codecs (12.5Hz) lose semantic information at even lower rates, which limits their effectiveness for speech language models that require shorter sequences for computational efficiency.

Method: Uses dynamic frame rate approach with adaptive frame merging in information-sparse regions, ASR feature-assisted dual stream encoding, and Transformer bottlenecks to preserve semantic information.

Result: Outperforms baseline systems in semantic preservation and audio reconstruction quality at 6.25Hz, 8.3Hz, and 12.5Hz average frame rates, and works effectively in language model-based TTS.

Conclusion: FlexiCodec successfully addresses semantic loss in very low frame rate audio codecs through dynamic frame rate control and novel architectural components, enabling better performance for speech language models.

Abstract: Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io

cs.LG

[378] Accelerating Long-Term Molecular Dynamics with Physics-Informed Time-Series Forecasting

Hung Le, Sherif Abbas, Minh Hoang Nguyen, Van Dai Do, Huu Hiep Nguyen, Dung Nguyen

Main category: cs.LG

TL;DR: A novel approach formulates molecular dynamics simulation as a time-series forecasting problem using displacement-based trajectory prediction with physics-informed constraints to enable efficient and accurate atomic-scale simulations.

Details

Motivation: Traditional DFT methods for molecular dynamics simulations are computationally expensive, limiting the feasibility of long-term simulations needed for materials science and biophysics research.

Method: Formulates MD simulation as time-series forecasting, predicts atomic trajectories via displacements rather than absolute positions, incorporates physics-informed loss using DFT-parametrized Morse potential functions to enforce physical plausibility.

Result: Consistently surpasses standard baselines in simulation accuracy across diverse materials, enables stable modeling of thousands of MD steps in minutes, offers scalable alternative to costly DFT simulations.

Conclusion: The method demonstrates the importance of incorporating physics knowledge to enhance reliability and precision of atomic trajectory forecasting, providing an efficient solution for long-term molecular dynamics simulations.

Abstract: Efficient molecular dynamics (MD) simulation is vital for understanding atomic-scale processes in materials science and biophysics. Traditional density functional theory (DFT) methods are computationally expensive, which limits the feasibility of long-term simulations. We propose a novel approach that formulates MD simulation as a time-series forecasting problem, enabling advanced forecasting models to predict atomic trajectories via displacements rather than absolute positions. We incorporate a physics-informed loss and inference mechanism based on DFT-parametrised pair-wise Morse potential functions that penalize unphysical atomic proximity to enforce physical plausibility. Our method consistently surpasses standard baselines in simulation accuracy across diverse materials. The results highlight the importance of incorporating physics knowledge to enhance the reliability and precision of atomic trajectory forecasting. Remarkably, it enables stable modeling of thousands of MD steps in minutes, offering a scalable alternative to costly DFT simulations.

[379] Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs

Sergey Troshin, Wafaa Mohammed, Yan Meng, Christof Monz, Antske Fokkens, Vlad Niculae

Main category: cs.LG

TL;DR: Selective sampling dynamically switches between greedy and high-temperature sampling based on a sampling risk metric to improve diversity while maintaining accuracy in mathematical reasoning tasks.

Details

Motivation: Temperature-based sampling increases diversity but degrades reasoning quality in precision-critical tasks like mathematical reasoning, due to sampling incorrect continuations in sensitive decoding positions.

Method: Proposes selective sampling with a lightweight classifier that predicts sampling risk for each token position, allowing dynamic switching between greedy and high-temperature sampling based on the estimated likelihood of errors.

Result: Experiments on mathematical reasoning tasks show that selective sampling enhances the quality-diversity trade-off, maintaining accuracy even in high-temperature settings.

Conclusion: Selective sampling effectively addresses the accuracy-diversity trade-off in precision-critical tasks by dynamically controlling sampling behavior based on position-specific risk assessment.

Abstract: Diversity is an essential metric for evaluating the creativity of outputs generated by language models. Temperature-based sampling is a common strategy to increase diversity. However, for tasks that require high precision, e.g., mathematical reasoning, uncontrolled high temperature sampling, e.g., min-$p$ or top-$p$, degrades reasoning quality. We demonstrate that the loss of accuracy is caused by sampling incorrect continuations in sensitive decoding positions. To address this, in this paper, we propose \textbf{selective sampling}, a method that dynamically switches between greedy and high-temperature sampling based on a sampling risk metric. This risk metric estimates the likelihood of output errors when applying high-temperature sampling on the current token position. To predict sampling risk, we train a lightweight classifier on a small subset of verifiable problems. The trained classifier can be integrated with the base language model with minimal latency overhead. Experiments on mathematical reasoning tasks demonstrate that selective sampling enhances the quality-diversity trade-off, even in high-temperature settings.

[380] Automated Extraction of Material Properties using LLM-based AI Agents

Subham Ghosh, Abhishek Tewari

Main category: cs.LG

TL;DR: An LLM-driven workflow autonomously extracts thermoelectric and structural properties from 10,000 scientific articles, creating the largest thermoelectric dataset to date with 27,822 records.

Details

Motivation: Materials discovery is constrained by lack of large, machine-readable datasets that couple performance metrics with structural context, with existing databases being small, manually curated, or biased toward first principles results.

Method: Agentic LLM-driven workflow integrating dynamic token allocation, zero-shot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost.

Result: GPT-4.1 achieved highest accuracy (F1=0.91 for thermoelectric, 0.82 for structural), while GPT-4.1 Mini delivered comparable performance at lower cost. Curated 27,822 temperature-resolved property records with normalized units and structural attributes.

Conclusion: The study delivers the largest LLM-curated thermoelectric dataset, provides reproducible extraction pipeline, and establishes foundation for scalable, data-driven materials discovery beyond thermoelectrics.

Abstract: The rapid discovery of materials is constrained by the lack of large, machine-readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)-driven workflow that autonomously extracts thermoelectric and structural-properties from about 10,000 full-text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT-4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT-4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p-type doping, while also surfacing broader structure-property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM-curated thermoelectric dataset to date, provides a reproducible and cost-profiled extraction pipeline, and establishes a foundation for scalable, data-driven materials discovery beyond thermoelectrics.

[381] RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models

Zukang Xu, Xing Hu, Qiang Wu, Dawei Yang

Main category: cs.LG

TL;DR: RSAVQ is a novel vector quantization framework that improves extremely low-bit quantization for LLMs using geometry-driven innovations to address direction error and bit allocation challenges.

Details

Motivation: LLMs face deployment challenges on resource-constrained devices due to large parameters. Existing vector quantization methods suffer from unconstrained direction error and suboptimal bit allocation in low-bit quantization (2-4 bits).

Method: RSAVQ introduces two key innovations: (1) Error Direction Sensitivity Guidance (EDSG) using Fisher Information Matrix-induced Riemannian metric to project quantization errors along negative natural gradient direction, and (2) Weight Channel Sensitivity Guidance (WCSG) using FIM curvature analysis for dynamic bit allocation.

Result: RSAVQ outperforms existing methods, achieving 0.4 lower perplexity and 1.5 higher zero-shot accuracy in 2-bit quantization of LLaMA-3 8B compared to baselines like VPTQ and QuIP#.

Conclusion: The work provides a practical solution for constrained environments and establishes a theoretical bridge between information geometry and neural network quantization, advancing efficient deep learning.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices. Vector Quantization (VQ) shows great promise for low-bit quantization (e.g., 2 to 4 bits), but existing work faces two key challenges: unconstrained direction error and suboptimal bit allocation. In this paper, we propose RSAVQ, a novel VQ framework to enhance extremely low-bit quantization for LLMs. RSAVQ introduces two geometry-driven innovations that effectively mitigate above limitations: (1) Error Direction Sensitivity Guidance (EDSG), which leverages the Fisher Information Matrix (FIM)-induced Riemannian metric to project quantization errors onto low-sensitivity directions in the parameter space. Specifically, this projection is performed along the negative natural gradient direction, which effectively suppresses error expansion. (2) Weight Channel Sensitivity Guidance (WCSG) , which constructs a channel-wise sensitivity metric via FIM curvature analysis to dynamically guide bit resource allocation. The approach facilitates a globally optimal quantization solution within prescribed bit constraints. Experiments demonstrate that RSAVQ outperforms existing methods for LLMs. For example, in 2-bit quantization of LLaMA-3 8B, RSAVQ leads baselines like VPTQ and QuIP# by 0.4 in perplexity (PPL) and 1.5 in zero-shot accuracy. This work offers a practical solution for constrained environments and a theoretical bridge between information geometry and the quantization of neural networks, advancing efficient deep learning.

[382] Adaptive Federated Learning Defences via Trust-Aware Deep Q-Networks

Vedant Palit

Main category: cs.LG

TL;DR: A trust-aware Deep Q-Network defense for federated learning that integrates multi-signal evidence to mitigate poisoning and backdoor attacks under partial observability.

Details

Motivation: Federated learning is vulnerable to poisoning and backdoor attacks, especially under partial observability conditions where complete client information is unavailable.

Method: Formulate defense as a partially observable sequential decision problem and introduce a trust-aware Deep Q-Network that integrates multi-signal evidence into client trust updates while optimizing long-horizon robustness-accuracy objectives.

Result: On CIFAR-10: (i) established baseline with steadily improving accuracy; (ii) increased client overlap improves accuracy and reduces attack success rate (ASR) with stable detection; (iii) accuracy remains steady while ASR increases and ROC-AUC declines as observability is reduced, showing sequential belief updates mitigate weaker signals.

Conclusion: DQN achieves the best robustness-accuracy trade-off compared to random, linear-Q, and policy gradient controllers, demonstrating effective defense against poisoning and backdoor attacks in federated learning.

Abstract: Federated learning is vulnerable to poisoning and backdoor attacks under partial observability. We formulate defence as a partially observable sequential decision problem and introduce a trust-aware Deep Q-Network that integrates multi-signal evidence into client trust updates while optimizing a long-horizon robustness–accuracy objective. On CIFAR-10, we (i) establish a baseline showing steadily improving accuracy, (ii) show through a Dirichlet sweep that increased client overlap consistently improves accuracy and reduces ASR with stable detection, and (iii) demonstrate in a signal-budget study that accuracy remains steady while ASR increases and ROC-AUC declines as observability is reduced, which highlights that sequential belief updates mitigate weaker signals. Finally, a comparison with random, linear-Q, and policy gradient controllers confirms that DQN achieves the best robustness–accuracy trade-off.

[383] RSTGCN: Railway-centric Spatio-Temporal Graph Convolutional Network for Train Delay Prediction

Koyena Chowdhury, Paramita Koley, Abhijnan Chakraborty, Saptarshi Ghosh

Main category: cs.LG

TL;DR: Proposes RSTGCN for forecasting average arrival delays at railway stations using spatio-temporal graph neural networks with train frequency-aware attention, using the largest railway dataset from Indian Railway Network.

Details

Motivation: Accurate train delay prediction is critical for efficient railway operations and traffic management, with recent focus shifting from individual train delays to station-level prediction for better scheduling decisions.

Method: Railway-centric Spatio-Temporal Graph Convolutional Network (RSTGCN) with architectural innovations including train frequency-aware spatial attention and novel feature integrations.

Result: Extensive experiments show consistent improvements across standard metrics compared to multiple state-of-the-art baselines.

Conclusion: The work advances average delay prediction modeling in large-scale rail networks and provides an open dataset (Indian Railway Network with 4,735 stations) to encourage further research.

Abstract: Accurate prediction of train delays is critical for efficient railway operations, enabling better scheduling and dispatching decisions. While earlier approaches have largely focused on forecasting the exact delays of individual trains, recent studies have begun exploring station-level delay prediction to support higher-level traffic management. In this paper, we propose the Railway-centric Spatio-Temporal Graph Convolutional Network (RSTGCN), designed to forecast average arrival delays of all the incoming trains at railway stations for a particular time period. Our approach incorporates several architectural innovations and novel feature integrations, including train frequency-aware spatial attention, which significantly enhances predictive performance. To support this effort, we curate and release a comprehensive dataset for the entire Indian Railway Network (IRN), spanning 4,735 stations across 17 zones - the largest and most diverse railway network studied to date. We conduct extensive experiments using multiple state-of-the-art baselines, demonstrating consistent improvements across standard metrics. Our work not only advances the modeling of average delay prediction in large-scale rail networks but also provides an open dataset to encourage further research in this critical domain.

[384] Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency

Yaron Meirovitch, Fuming Yang, Jeff Lichtman, Nir Shavit

Main category: cs.LG

TL;DR: Budgeted Broadcast (BB) is a pruning method that uses local traffic budgets (activity × fan-out) to maximize coding entropy, achieving better accuracy at matched sparsity across various architectures.

Details

Motivation: Most pruning methods remove parameters based on loss impact (magnitude/gradient), but BB aims to maximize coding entropy under traffic constraints for more diverse representations.

Method: BB assigns local traffic budgets to units, uses constrained-entropy analysis to derive selectivity-audience balance, and enforces it with local actuators that prune fan-in (lower activity) or fan-out (reduce broadcast).

Result: BB increases coding entropy and decorrelation, improves accuracy at matched sparsity across Transformers, ResNets, and 3D U-Nets, sometimes exceeding dense baselines. Achieves SOTA F1 and PR-AUC on electron microscopy images.

Conclusion: BB is easy to integrate and suggests a path toward learning more diverse and efficient representations through traffic-constrained entropy maximization.

Abstract: Most pruning methods remove parameters ranked by impact on loss (e.g., magnitude or gradient). We propose Budgeted Broadcast (BB), which gives each unit a local traffic budget (the product of its long-term on-rate $a_i$ and fan-out $k_i$). A constrained-entropy analysis shows that maximizing coding entropy under a global traffic budget yields a selectivity-audience balance, $\log\frac{1-a_i}{a_i}=\beta k_i$. BB enforces this balance with simple local actuators that prune either fan-in (to lower activity) or fan-out (to reduce broadcast). In practice, BB increases coding entropy and decorrelation and improves accuracy at matched sparsity across Transformers for ASR, ResNets for face identification, and 3D U-Nets for synapse prediction, sometimes exceeding dense baselines. On electron microscopy images, it attains state-of-the-art F1 and PR-AUC under our evaluation protocol. BB is easy to integrate and suggests a path toward learning more diverse and efficient representations.

[385] A Framework for Scalable Heterogeneous Multi-Agent Adversarial Reinforcement Learning in IsaacLab

Isaac Peterson, Christopher Allred, Jacob Morrey, Mario Harper

Main category: cs.LG

TL;DR: Extends IsaacLab framework for adversarial multi-agent reinforcement learning in physics simulations, introducing competitive environments and HAPPO variant for training robust policies.

Details

Motivation: Adversarial interactions are critical for real-world applications like pursuit-evasion and security, but prior MARL work focused mainly on collaborative settings.

Method: Extended IsaacLab framework with adversarial MARL environments featuring heterogeneous agents with asymmetric goals, integrated competitive HAPPO for efficient training.

Result: Framework successfully models and trains robust policies for morphologically diverse multi-agent competition while maintaining high throughput and simulation realism.

Conclusion: The platform enables scalable training of adversarial policies in high-fidelity physics simulations, advancing MARL for competitive applications.

Abstract: Multi-Agent Reinforcement Learning (MARL) is central to robotic systems cooperating in dynamic environments. While prior work has focused on these collaborative settings, adversarial interactions are equally critical for real-world applications such as pursuit-evasion, security, and competitive manipulation. In this work, we extend the IsaacLab framework to support scalable training of adversarial policies in high-fidelity physics simulations. We introduce a suite of adversarial MARL environments featuring heterogeneous agents with asymmetric goals and capabilities. Our platform integrates a competitive variant of Heterogeneous Agent Reinforcement Learning with Proximal Policy Optimization (HAPPO), enabling efficient training and evaluation under adversarial dynamics. Experiments across several benchmark scenarios demonstrate the framework’s ability to model and train robust policies for morphologically diverse multi-agent competition while maintaining high throughput and simulation realism. Code and benchmarks are available at: https://github.com/DIRECTLab/IsaacLab-HARL .

[386] RLP: Reinforcement as a Pretraining Objective

Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

Main category: cs.LG

TL;DR: RLP introduces reinforcement learning during pretraining by treating chain-of-thought reasoning as exploratory actions, rewarding information gain for future token prediction, improving reasoning capabilities before post-training.

Details

Motivation: Current training paradigm delays reinforcement learning to post-training phase, missing opportunities to teach reasoning behavior earlier in pretraining through exploration.

Method: RLP uses information-driven reinforcement pretraining where chain-of-thought is treated as exploratory action, with rewards based on information gain for predicting future tokens (increase in log-likelihood when conditioning on reasoning chain vs context alone).

Result: RLP on Qwen3-1.7B-Base improved overall average across eight math-and-science benchmarks by 19%, with largest gains on reasoning-heavy tasks like AIME25 and MMLU-Pro. On Nemotron-Nano-12B-v2, overall average increased from 42.81% to 61.32% with 23% improvement on scientific reasoning.

Conclusion: RLP successfully bridges gap between next-token prediction and chain-of-thought reasoning by introducing reinforcement learning during pretraining, demonstrating scalable improvements across architectures and model sizes.

Abstract: The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning – exploration – to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

[387] Safe Reinforcement Learning-Based Vibration Control: Overcoming Training Risks with LQR Guidance

Rohan Vitthal Thorat, Juhi Singh, Rajdip Nayek

Main category: cs.LG

TL;DR: A hybrid LQR-RL framework for structural vibration control that eliminates the need for system identification while ensuring safety during RL training by using an LQR controller based on an incorrect model.

Details

Motivation: To address safety risks during RL controller training on physical structures, where random control actions could damage the structure, while avoiding tedious system identification required by conventional model-based methods.

Method: Proposes a hybrid control framework combining LQR and RL controllers, where LQR uses a randomly selected model (not requiring accurate system knowledge) to guide RL training safely, making the overall approach model-free.

Result: The hybrid approach eliminates dependency on explicit system models while minimizing exploration risks in RL training, with LQR based on incorrect models still outperforming uncontrolled scenarios.

Conclusion: This is the first study to address RL training safety in vibration control, providing a validated model-free solution that combines the benefits of both LQR and RL approaches.

Abstract: Structural vibrations induced by external excitations pose significant risks, including safety hazards for occupants, structural damage, and increased maintenance costs. While conventional model-based control strategies, such as Linear Quadratic Regulator (LQR), effectively mitigate vibrations, their reliance on accurate system models necessitates tedious system identification. This tedious system identification process can be avoided by using a model-free Reinforcement learning (RL) method. RL controllers derive their policies solely from observed structural behaviour, eliminating the requirement for an explicit structural model. For an RL controller to be truly model-free, its training must occur on the actual physical system rather than in simulation. However, during this training phase, the RL controller lacks prior knowledge and it exerts control force on the structure randomly, which can potentially harm the structure. To mitigate this risk, we propose guiding the RL controller using a Linear Quadratic Regulator (LQR) controller. While LQR control typically relies on an accurate structural model for optimal performance, our observations indicate that even an LQR controller based on an entirely incorrect model outperforms the uncontrolled scenario. Motivated by this finding, we introduce a hybrid control framework that integrates both LQR and RL controllers. In this approach, the LQR policy is derived from a randomly selected model and its parameters. As this LQR policy does not require knowledge of the true or an approximate structural model the overall framework remains model-free. This hybrid approach eliminates dependency on explicit system models while minimizing exploration risks inherent in naive RL implementations. As per our knowledge, this is the first study to address the critical training safety challenge of RL-based vibration control and provide a validated solution.

[388] Identifying Information-Transfer Nodes in a Recurrent Neural Network Reveals Dynamic Representations

Arend Hintze, Asadullah Najam, Jory Schossau

Main category: cs.LG

TL;DR: An information-theoretic method identifies information-transfer nodes (information relays) in RNNs by quantifying mutual information between input and output vectors, revealing critical information pathways in various RNN architectures.

Details

Motivation: Understanding RNN internal dynamics is crucial for advancing interpretability and improving network design, particularly for explaining how information flows through these complex systems.

Method: Developed an information-theoretic approach to quantify mutual information between input and output vectors across nodes, applied to synthetic and real-world time series classification tasks using LSTM and GRU architectures, with node knockout experiments to assess functional importance.

Result: Revealed distinct patterns of information relay across different RNN architectures, showing how information is processed and maintained over time, with identified nodes significantly influencing overall network behavior.

Conclusion: The study enhances understanding of RNN mechanisms and provides a valuable tool for designing more robust and interpretable neural networks, contributing significantly to explainable AI.

Abstract: Understanding the internal dynamics of Recurrent Neural Networks (RNNs) is crucial for advancing their interpretability and improving their design. This study introduces an innovative information-theoretic method to identify and analyze information-transfer nodes within RNNs, which we refer to as \textit{information relays}. By quantifying the mutual information between input and output vectors across nodes, our approach pinpoints critical pathways through which information flows during network operations. We apply this methodology to both synthetic and real-world time series classification tasks, employing various RNN architectures, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). Our results reveal distinct patterns of information relay across different architectures, offering insights into how information is processed and maintained over time. Additionally, we conduct node knockout experiments to assess the functional importance of identified nodes, significantly contributing to explainable artificial intelligence by elucidating how specific nodes influence overall network behavior. This study not only enhances our understanding of the complex mechanisms driving RNNs but also provides a valuable tool for designing more robust and interpretable neural networks.

[389] Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning

Hengwei Zhao, Zhengzhong Tu, Zhuo Zheng, Wei Wang, Junjue Wang, Rusty Feagin, Wenzhe Jiao

Main category: cs.LG

TL;DR: NcPU is a non-contrastive PU learning framework that addresses representation learning challenges in Positive-Unlabeled classification by combining noisy-pair robust supervised non-contrastive loss with phantom label disambiguation, achieving state-of-the-art performance without requiring auxiliary negatives or pre-estimated parameters.

Details

Motivation: State-of-the-art PU learning methods substantially underperform supervised counterparts on complex datasets (e.g., 14.26% gap on CIFAR-100), primarily due to the challenge of learning discriminative representations under unreliable supervision without auxiliary negatives or pre-estimated parameters.

Method: Proposes NcPU framework with two key components: (1) NoiSNCL - a noisy-pair robust supervised non-contrastive loss that aligns intra-class representations despite unreliable supervision, and (2) PLD - phantom label disambiguation scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, these components iteratively benefit each other under the Expectation-Maximization framework.

Result: Extensive experiments show: (1) NoiSNCL enables simple PU methods to achieve competitive performance; (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging post-disaster building damage mapping applications.

Conclusion: NcPU effectively addresses the representation learning bottleneck in PU learning, demonstrating strong performance on complex datasets and real-world applications without requiring auxiliary information, making it a promising solution for practical PU learning scenarios.

Abstract: Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: Code will be open-sourced after review.

[390] Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

Rui Melo, Rui Abreu, Corina S. Pasareanu

Main category: cs.LG

TL;DR: A microsaccade-inspired probing method uses lightweight position encoding perturbations to detect LLM misbehavior without fine-tuning or supervision.

Details

Motivation: Drawing inspiration from microsaccades in human vision that reveal hidden perceptual dynamics, the authors aim to develop an analogous method to uncover latent failure signals in LLMs.

Method: Apply lightweight position encoding perturbations to LLMs, similar to microsaccades in human vision, to elicit latent signals that indicate model misbehavior.

Result: The method successfully detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks in multiple state-of-the-art LLMs, while remaining computationally efficient.

Conclusion: Pretrained LLMs already encode internal evidence to flag their own failures, and microsaccade-inspired interventions provide an effective pathway for detecting and mitigating undesirable behaviors.

Abstract: We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.

Harry Robertshaw, Han-Ru Wu, Alejandro Granados, Thomas C Booth

Main category: cs.LG

TL;DR: TD-MPC2 model-based RL significantly outperforms SAC in autonomous endovascular navigation across multiple patient vasculatures, achieving 65% success rate vs 37% for SAC, though with longer procedure times.

Details

Motivation: Autonomous navigation for mechanical thrombectomy is challenging due to complex vascular anatomy and need for real-time precision. Current RL methods struggle with generalization across multiple patients and long-horizon tasks.

Method: Proposed world model for autonomous endovascular navigation using TD-MPC2 model-based RL algorithm. Trained single RL agent across multiple navigation tasks in ten real patient vasculatures, compared against Soft Actor-Critic (SAC).

Result: TD-MPC2 achieved 65% mean success rate vs SAC’s 37%, with notable improvements in path ratio. However, TD-MPC2 exhibited increased procedure times, indicating trade-off between success rate and execution speed.

Conclusion: World models show strong potential for improving autonomous endovascular navigation and provide foundation for future research in generalizable AI-driven robotic interventions.

Abstract: Autonomous navigation for mechanical thrombectomy (MT) remains a critical challenge due to the complexity of vascular anatomy and the need for precise, real-time decision-making. Reinforcement learning (RL)-based approaches have demonstrated potential in automating endovascular navigation, but current methods often struggle with generalization across multiple patient vasculatures and long-horizon tasks. We propose a world model for autonomous endovascular navigation using TD-MPC2, a model-based RL algorithm. We trained a single RL agent across multiple endovascular navigation tasks in ten real patient vasculatures, comparing performance against the state-of-the-art Soft Actor-Critic (SAC) method. Results indicate that TD-MPC2 significantly outperforms SAC in multi-task learning, achieving a 65% mean success rate compared to SAC’s 37%, with notable improvements in path ratio. TD-MPC2 exhibited increased procedure times, suggesting a trade-off between success rate and execution speed. These findings highlight the potential of world models for improving autonomous endovascular navigation and lay the foundation for future research in generalizable AI-driven robotic interventions.

[392] ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna

Main category: cs.LG

TL;DR: ThinKV is a KV cache compression framework that uses hybrid quantization-eviction based on thought importance in chain-of-thought reasoning, achieving near-lossless accuracy with <5% of original KV cache size and up to 5.8x higher throughput.

Details

Motivation: Large reasoning models generate long chain-of-thought outputs that cause KV cache to grow rapidly and overwhelm GPU memory, creating a memory bottleneck for inference.

Method: Uses attention sparsity to identify thought types with varying importance, applies hybrid quantization-eviction strategy based on thought importance, and implements efficient memory reuse through extended PagedAttention kernel.

Result: Achieves near-lossless accuracy with less than 5% of original KV cache size while improving inference throughput by up to 5.8x compared to state-of-the-art baselines on mathematics and coding benchmarks.

Conclusion: ThinKV effectively addresses KV cache memory bottleneck in long-context reasoning models through thought-adaptive compression, enabling efficient inference with minimal accuracy loss.

Abstract: The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens’ memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

[393] Network-Level Vehicle Delay Estimation at Heterogeneous Signalized Intersections

Xiaobo Ma, Hyunsoo Noh, James Tokishi, Ryan Hatch

Main category: cs.LG

TL;DR: A domain adaptation framework using Gradient Boosting with Balanced Weighting (GBBW) improves vehicle delay estimation across diverse intersections by reweighting source data based on target domain similarity.

Details

Motivation: Conventional ML models for vehicle delay estimation assume same data distribution between training and testing, which rarely holds in real-world due to variations in road geometry, signal timing, and driver behavior across intersections, leading to poor generalization.

Method: Domain adaptation framework that separates data into source/target domains, extracts traffic features, fine-tunes with small labeled target subset, and uses GBBW to reweight source data based on similarity to target domain.

Result: Tested on 57 heterogeneous intersections in Pima County, Arizona. GBBW outperformed 8 state-of-the-art ML regression models and 7 instance-based DA methods, providing more accurate and robust delay estimates.

Conclusion: The framework enhances model transferability, supporting reliable traffic signal optimization, congestion management, and broader deployment of ML in transportation systems.

Abstract: Accurate vehicle delay estimation is essential for evaluating the performance of signalized intersections and informing traffic management strategies. Delay reflects congestion levels and affects travel time reliability, fuel use, and emissions. Machine learning (ML) offers a scalable, cost-effective alternative; However, conventional models typically assume that training and testing data follow the same distribution, an assumption that is rarely satisfied in real-world applications. Variations in road geometry, signal timing, and driver behavior across intersections often lead to poor generalization and reduced model accuracy. To address this issue, this study introduces a domain adaptation (DA) framework for estimating vehicle delays across diverse intersections. The framework separates data into source and target domains, extracts key traffic features, and fine-tunes the model using a small, labeled subset from the target domain. A novel DA model, Gradient Boosting with Balanced Weighting (GBBW), reweights source data based on similarity to the target domain, improving adaptability. The framework is tested using data from 57 heterogeneous intersections in Pima County, Arizona. Performance is evaluated against eight state-of-the-art ML regression models and seven instance-based DA methods. Results demonstrate that the GBBW framework provides more accurate and robust delay estimates. This approach supports more reliable traffic signal optimization, congestion management, and performance-based planning. By enhancing model transferability, the framework facilitates broader deployment of machine learning techniques in real-world transportation systems.

[394] From 2D to 3D, Deep Learning-based Shape Reconstruction in Magnetic Resonance Imaging: A Review

Emma McMillian, Abhirup Banerjee, Alfonso Bueno-Orovio

Main category: cs.LG

TL;DR: This review surveys deep learning methods for 3D shape reconstruction from 2D MRI, covering four main approaches: point cloud, mesh-based, shape-aware, and volumetric models, with analysis of techniques, limitations, applications, datasets, and future directions.

Details

Motivation: 3D shape reconstruction from 2D MRI is crucial for medical diagnosis, treatment planning, and computational modeling, requiring a comprehensive overview of current methodologies to advance deep learning solutions.

Method: Systematic review of four primary approaches: point cloud models, mesh-based models, shape-aware models, and volumetric models, analyzing state-of-the-art techniques, methodological foundations, and applications across anatomical structures.

Result: Provides extensive analysis of current techniques, their limitations, clinical applicability to diseased anatomy, computational demands, evaluation metrics, and publicly available datasets across cardiac, neurological, and lung imaging domains.

Conclusion: The review offers a structured overview to help researchers identify opportunities for advancing deep learning towards more robust, generalizable, and clinically impactful 3D reconstruction solutions, highlighting emerging directions like multimodal integration and cross-modality frameworks.

Abstract: Deep learning-based 3-dimensional (3D) shape reconstruction from 2-dimensional (2D) magnetic resonance imaging (MRI) has become increasingly important in medical disease diagnosis, treatment planning, and computational modeling. This review surveys the methodological landscape of 3D MRI reconstruction, focusing on 4 primary approaches: point cloud, mesh-based, shape-aware, and volumetric models. For each category, we analyze the current state-of-the-art techniques, their methodological foundation, limitations, and applications across anatomical structures. We provide an extensive overview ranging from cardiac to neurological to lung imaging. We also focus on the clinical applicability of models to diseased anatomy, and the influence of their training and testing data. We examine publicly available datasets, computational demands, and evaluation metrics. Finally, we highlight the emerging research directions including multimodal integration and cross-modality frameworks. This review aims to provide researchers with a structured overview of current 3D reconstruction methodologies to identify opportunities for advancing deep learning towards more robust, generalizable, and clinically impactful solutions.

[395] Low Rank Gradients and Where to Find Them

Rishi Sonthalia, Michael Murray, Guido Montúfar

Main category: cs.LG

TL;DR: This paper shows that gradients in two-layer neural networks have low-rank structure dominated by two rank-one components aligned with bulk data-residue and input data spikes, even without isotropic assumptions.

Details

Motivation: To understand gradient structure in neural networks while relaxing common isotropic assumptions on data and parameters, examining how data properties and scaling regimes affect gradient composition.

Method: Theoretical analysis using spiked data model with anisotropic bulk, considering both mean-field and neural-tangent-kernel scalings, and investigating effects of regularizers like weight decay and input noise.

Result: Gradient is approximately low-rank with two dominant rank-one terms: one aligned with bulk data-residue and another with input data spike. Data properties, scaling regime, and activation function determine balance between components.

Conclusion: Neural network gradients exhibit structured low-rank properties influenced by data characteristics and regularizers, providing insights for optimization and regularization design.

Abstract: This paper investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue , and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components. Experiments on synthetic and real data corroborate our theoretical predictions.

[396] Quantum-inspired Benchmark for Estimating Intrinsic Dimension

Aritra Das, Joseph T. Iosue, Victor V. Albert

Main category: cs.LG

TL;DR: QuIIEst benchmark introduces complex manifolds for testing intrinsic dimension estimation methods, showing lower accuracy than existing benchmarks and minimal performance degradation with non-uniform curvature.

Details

Motivation: Existing intrinsic dimension estimation (IDE) methods produce varying estimates, necessitating benchmarking on more complex manifolds than current benchmarks provide.

Method: Proposed Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark using infinite families of topologically non-trivial manifolds with known ID, derived from quantum-optical embedding methods with curvature modification and additive noise.

Result: IDE methods were generally less accurate on QuIIEst manifolds than existing benchmarks, with minimal performance degradation despite increasingly non-uniform curvature, indicating the benchmark’s inherent difficulty.

Conclusion: QuIIEst provides a challenging benchmark for IDE methods, and some methods can extract effective dimension even from non-manifold spaces like Hofstadter’s butterfly.

Abstract: Machine learning models can generalize well on real-world datasets. According to the manifold hypothesis, this is possible because datasets lie on a latent manifold with small intrinsic dimension (ID). There exist many methods for ID estimation (IDE), but their estimates vary substantially. This warrants benchmarking IDE methods on manifolds that are more complex than those in existing benchmarks. We propose a Quantum-Inspired Intrinsic-dimension Estimation (QuIIEst) benchmark consisting of infinite families of topologically non-trivial manifolds with known ID. Our benchmark stems from a quantum-optical method of embedding arbitrary homogeneous spaces while allowing for curvature modification and additive noise. The IDE methods tested were generally less accurate on QuIIEst manifolds than on existing benchmarks under identical resource allocation. We also observe minimal performance degradation with increasingly non-uniform curvature, underscoring the benchmark’s inherent difficulty. As a result of independent interest, we perform IDE on the fractal Hofstadter’s butterfly and identify which methods are capable of extracting the effective dimension of a space that is not a manifold.

[397] On the Identifiability of Latent Action Policies

Sébastien Lachapelle

Main category: cs.LG

TL;DR: The paper analyzes identifiability in latent action policy learning (LAPO), proving that an entropy-regularized objective can identify action representations under certain conditions.

Details

Motivation: To understand why discrete action representations work well in practice and formally establish conditions for identifiability in LAPO frameworks.

Method: The authors formally describe desiderata for action representations, analyze statistical benefits and sources of unidentifiability, and prove identifiability results for entropy-regularized LAPO objectives.

Result: The analysis shows that under suitable conditions, entropy-regularized LAPO can identify action representations that satisfy the proposed desiderata.

Conclusion: The theoretical analysis provides an explanation for the empirical success of discrete action representations in practice.

Abstract: We study the identifiability of latent action policy learning (LAPO), a framework introduced recently to discover representations of actions from video data. We formally describe desiderata for such representations, their statistical benefits and potential sources of unidentifiability. Finally, we prove that an entropy-regularized LAPO objective identifies action representations satisfying our desiderata, under suitable conditions. Our analysis provides an explanation for why discrete action representations perform well in practice.

[398] Self-Supervised Representation Learning as Mutual Information Maximization

Akhlaqur Rahman Sabby, Yi Sui, Tongzi Wu, Jesse C. Cresswell, Ga Wu

Main category: cs.LG

TL;DR: This paper provides a theoretical framework that explains architectural choices in self-supervised representation learning (SSRL) by deriving two training paradigms from variational mutual information objectives: Self-Distillation MI (SDMI) and Joint MI (JMI).

Details

Motivation: To understand the underlying principles of SSRL methods beyond empirical heuristics, and to determine whether learning objectives dictate optimization strategies and model design choices.

Method: Starting from a variational mutual information lower bound, the authors derive two training paradigms: SDMI (requiring alternating optimization and stop-gradient operations) and JMI (admitting joint optimization through symmetric architectures).

Result: The paper shows that predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for mutual information objectives, and demonstrates that many existing SSRL methods are specific instances of these two paradigms.

Conclusion: The proposed formulation provides a theoretical explanation for architectural components in SSRL methods, moving beyond heuristic conveniences to principled design choices dictated by mutual information objectives.

Abstract: Self-supervised representation learning (SSRL) has demonstrated remarkable empirical success, yet its underlying principles remain insufficiently understood. While recent works attempt to unify SSRL methods by examining their information-theoretic objectives or summarizing their heuristics for preventing representation collapse, architectural elements like the predictor network, stop-gradient operation, and statistical regularizer are often viewed as empirically motivated additions. In this paper, we adopt a first-principles approach and investigate whether the learning objective of an SSRL algorithm dictates its possible optimization strategies and model design choices. In particular, by starting from a variational mutual information (MI) lower bound, we derive two training paradigms, namely Self-Distillation MI (SDMI) and Joint MI (JMI), each imposing distinct structural constraints and covering a set of existing SSRL algorithms. SDMI inherently requires alternating optimization, making stop-gradient operations theoretically essential. In contrast, JMI admits joint optimization through symmetric architectures without such components. Under the proposed formulation, predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for the MI objective. We show that many existing SSRL methods are specific instances or approximations of these two paradigms. This paper provides a theoretical explanation behind the choices of different architectural components of existing SSRL methods, beyond heuristic conveniences.

[399] To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking

Hannah Lawrence, Elyssa Hofgard, Vasco Portilheiro, Yuxuan Chen, Tess Smidt, Robin Walters

Main category: cs.LG

TL;DR: The paper proposes a metric to quantify dataset anisotropy (symmetry-breaking) and shows that symmetry-aware methods don’t always perform optimally even when labels are invariant, challenging assumptions about equivariance benefits.

Details

Motivation: To critically evaluate the assumption that symmetry-aware methods (data augmentation, equivariant architectures) improve generalization by assuming transformed datapoints are important under test distribution.

Method: Developed a metric using two-sample neural classifier test to distinguish between original dataset and its randomly augmented equivalent, validated on synthetic datasets and applied to point cloud datasets.

Result: Uncovered surprisingly high degrees of alignment in benchmark point cloud datasets, showed theoretically that distributional symmetry-breaking prevents invariant methods from optimal performance, and found equivariant methods’ benefits are dataset-dependent.

Conclusion: Understanding equivariance requires rethinking symmetry biases in data, as symmetry-aware methods don’t always provide benefits and their effectiveness depends on dataset characteristics.

Abstract: Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or “important”, under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of anisotropy, or symmetry-breaking, in a dataset, via a two-sample neural classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of alignment in several benchmark point cloud datasets. We show theoretically that distributional symmetry-breaking can actually prevent invariant methods from performing optimally even when the underlying labels are truly invariant, as we show for invariant ridge regression in the infinite feature limit. Empirically, we find that the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some anisotropic datasets, but not others. Overall, these findings suggest that understanding equivariance – both when it works, and why – may require rethinking symmetry biases in the data.

[400] RheOFormer: A generative transformer model for simulation of complex fluids and flows

Maedeh Saberi, Amir Barati Farimani, Safa Jamali

Main category: cs.LG

TL;DR: RheOFormer is a generative operator learning method using self-attention to efficiently model complex fluid flows, accurately predicting nonlinear mechanics of various fluids with strong generalization and computational efficiency.

Details

Motivation: Traditional numerical methods for non-Newtonian fluid dynamics are computationally demanding and poorly scalable, while existing data-driven methods require retraining across different physical conditions.

Method: Rheological Operator Transformer (RheOFormer) leverages self-attention to learn spatial interactions and features of complex fluid flows, using generative operator learning.

Result: RheOFormer accurately learns scalar and tensorial nonlinear mechanics of different complex fluids and predicts spatio-temporal flow evolution, even with limited training data.

Conclusion: RheOFormer serves as a robust neural surrogate for accelerating complex fluid simulations, advancing data-driven experimentation, and enabling real-time process optimization across applications.

Abstract: The ability to model mechanics of soft materials under flowing conditions is key in designing and engineering processes and materials with targeted properties. This generally requires solution of internal stress tensor, related to the deformation tensor through nonlinear and history-dependent constitutive models. Traditional numerical methods for non-Newtonian fluid dynamics often suffer from prohibitive computational demands and poor scalability to new problem instances. Developments in data-driven methods have mitigated some limitations but still require retraining across varied physical conditions. In this work, we introduce Rheological Operator Transformer (RheOFormer), a generative operator learning method leveraging self-attention to efficiently learn different spatial interactions and features of complex fluid flows. We benchmark RheOFormer across a range of different viscometric and non-viscometric flows with different types of viscoelastic and elastoviscoplastic mechanics in complex domains against ground truth solutions. Our results demonstrate that RheOFormer can accurately learn both scalar and tensorial nonlinear mechanics of different complex fluids and predict the spatio-temporal evolution of their flows, even when trained on limited datasets. Its strong generalization capabilities and computational efficiency establish RheOFormer as a robust neural surrogate for accelerating predictive complex fluid simulations, advancing data-driven experimentation, and enabling real-time process optimization across a wide range of applications.

[401] Selective Underfitting in Diffusion Models

Kiwhan Song, Jaeyeon Kim, Sitan Chen, Yilun Du, Sham Kakade, Vincent Sitzmann

Main category: cs.LG

TL;DR: Diffusion models selectively underfit the empirical score function - accurately approximating it in certain regions while underfitting in others, which enables novel sample generation rather than just reproducing training data.

Details

Motivation: To understand why diffusion models can generate novel samples instead of simply reproducing training data, by investigating which score functions they actually learn during training.

Method: Introduce the concept of selective underfitting, characterize regions where diffusion models accurately approximate vs underfit the score, and design empirical interventions to validate this perspective.

Result: Established that selective underfitting is essential for understanding diffusion models, with better models more accurately approximating the score in certain regions while underfitting in others.

Conclusion: Selective underfitting provides new testable insights into diffusion models’ generalization and generative performance, explaining how they generate novel samples rather than just reproducing training data.

Abstract: Diffusion models have emerged as the principal paradigm for generative modeling across various domains. During training, they learn the score function, which in turn is used to generate samples at inference. They raise a basic yet unsolved question: which score do they actually learn? In principle, a diffusion model that matches the empirical score in the entire data space would simply reproduce the training data, failing to generate novel samples. Recent work addresses this question by arguing that diffusion models underfit the empirical score due to training-time inductive biases. In this work, we refine this perspective, introducing the notion of selective underfitting: instead of underfitting the score everywhere, better diffusion models more accurately approximate the score in certain regions of input space, while underfitting it in others. We characterize these regions and design empirical interventions to validate our perspective. Our results establish that selective underfitting is essential for understanding diffusion models, yielding new, testable insights into their generalization and generative performance.

[402] Fine-Tuning Masked Diffusion for Provable Self-Correction

Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen

Main category: cs.LG

TL;DR: PRISM is a lightweight, model-agnostic approach for self-correction in Masked Diffusion Models that learns per-token quality scores without RL or verifiers, improving inference across domains like Sudoku, text, and code.

Details

Motivation: Current approaches for self-correction in Masked Diffusion Models either require architectural/training changes or rely on imprecise proxies for token quality, limiting their applicability.

Method: PRISM defines a self-correction loss that provably learns per-token quality scores in the same forward pass with MDM, using these scores to detect and revise low-quality tokens.

Result: PRISM advances MDM inference across multiple domains and scales: Sudoku puzzles, unconditional text generation (170M parameters), and code generation with LLaDA (8B parameters).

Conclusion: PRISM provides an effective, lightweight solution for self-correction in Masked Diffusion Models that works with any pretrained MDM without requiring architectural changes or reinforcement learning.

Abstract: A natural desideratum for generative models is self-correction–detecting and revising low-quality tokens at inference. While Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces, their capacity for self-correction remains poorly understood. Prior attempts to incorporate self-correction into MDMs either require overhauling MDM architectures/training or rely on imprecise proxies for token quality, limiting their applicability. Motivated by this, we introduce PRISM–Plug-in Remasking for Inference-time Self-correction of Masked Diffusions–a lightweight, model-agnostic approach that applies to any pretrained MDM. Theoretically, PRISM defines a self-correction loss that provably learns per-token quality scores, without RL or a verifier. These quality scores are computed in the same forward pass with MDM and used to detect low-quality tokens. Empirically, PRISM advances MDM inference across domains and scales: Sudoku; unconditional text (170M); and code with LLaDA (8B).

[403] Optimal Stopping vs Best-of-$N$ for Inference Time Optimization

Yusuf Kalayci, Vinod Raman, Shaddin Dughmi

Main category: cs.LG

TL;DR: A new inference-time optimization framework for LLMs based on Pandora’s Box problem, achieving 15-35% fewer generations while maintaining same performance as Best-of-N sampling.

Details

Motivation: Balance LLM output quality against inference cost when using multiple generations, addressing the trade-off between generation quality and computational expense.

Method: Developed UCB-style Pandora’s Box algorithm that adapts to unknown reward distributions, with Bradley-Terry inspired transformation for reward scaling across prompts. Learns stopping thresholds adaptively during inference.

Result: Experiments on AlpacaFarm and HH-RLHF datasets show the adaptive strategy achieves same performance as Best-of-N sampling with 15-35% fewer generations on average.

Conclusion: Establishes principled bridge between optimal stopping theory and inference-time scaling, providing both theoretical performance bounds and practical efficiency gains for LLM deployment.

Abstract: Large language model (LLM) generation often requires balancing output quality against inference cost, especially when using multiple generations. We introduce a new framework for inference-time optimization based on the classical Pandora’s Box problem. Viewing each generation as opening a costly “box” with random reward, we develop algorithms that decide when to stop generating without knowing the underlying reward distribution. Our first contribution is a UCB-style Pandora’s Box algorithm, which achieves performance that is provably close to Weitzman’s algorithm, the optimal strategy when the distribution is known. We further adapt this method to practical LLM settings by addressing reward scaling across prompts via a Bradley-Terry inspired transformation. This leads to an adaptive inference-time optimization method that normalizes rewards and learns stopping thresholds on the fly. Experiments on the AlpacaFarm and HH-RLHF datasets, using multiple LLM-reward model pairs, show that our adaptive strategy can obtain the same performance as non-adaptive Best-of-N sampling while requiring 15-35 percent fewer generations on average. Our results establish a principled bridge between optimal stopping theory and inference-time scaling, providing both theoretical performance bounds and practical efficiency gains for LLM deployment.

[404] Neural Network Surrogates for Free Energy Computation of Complex Chemical Systems

Wasut Pornpatcharapong

Main category: cs.LG

TL;DR: Neural network surrogate framework eliminates need for analytical Jacobians in free energy reconstruction, enabling use of complex machine-learned collective variables.

Details

Motivation: Traditional free energy methods require analytical Jacobians of collective variables, which restricts the use of complex or machine-learned CVs.

Method: Neural network surrogate that learns CVs from Cartesian coordinates and uses automatic differentiation to provide Jacobians, bypassing analytical forms.

Result: Achieved high accuracy for both simple distance CV and complex coordination-number CV on MgCl2 ion-pairing system, with Jacobian errors following near-Gaussian distribution.

Conclusion: Framework enables gradient-based free energy methods to incorporate complex and machine-learned CVs, broadening scope of biochemistry and materials simulations.

Abstract: Free energy reconstruction methods such as Gaussian Process Regression (GPR) require Jacobians of the collective variables (CVs), a bottleneck that restricts the use of complex or machine-learned CVs. We introduce a neural network surrogate framework that learns CVs directly from Cartesian coordinates and uses automatic differentiation to provide Jacobians, bypassing analytical forms. On an MgCl2 ion-pairing system, our method achieved high accuracy for both a simple distance CV and a complex coordination-number CV. Moreover, Jacobian errors also followed a near-Gaussian distribution, making them suitable for GPR pipelines. This framework enables gradient-based free energy methods to incorporate complex and machine-learned CVs, broadening the scope of biochemistry and materials simulations.

[405] Ultra-Efficient Decoding for End-to-End Neural Compression and Reconstruction

Ethan G. Rogers, Cheng Wang

Main category: cs.LG

TL;DR: A new neural compression framework using low-rank operations to eliminate decoder compute bottlenecks while maintaining high-quality image reconstruction.

Details

Motivation: Current neural compression methods suffer from decoder complexity and high computational costs, hindering their practical adoption despite achieving good compression rates.

Method: Incorporates low-rank representation in an autoencoder with vector quantization, performing computationally efficient low-rank operations on learned latent representations.

Result: Dramatically reduces computational overhead in decoding phase while maintaining high fidelity of image outputs.

Conclusion: The approach effectively eliminates the decoder compute bottleneck in neural compression/reconstruction systems.

Abstract: Image compression and reconstruction are crucial for various digital applications. While contemporary neural compression methods achieve impressive compression rates, the adoption of such technology has been largely hindered by the complexity and large computational costs of the convolution-based decoders during data reconstruction. To address the decoder bottleneck in neural compression, we develop a new compression-reconstruction framework based on incorporating low-rank representation in an autoencoder with vector quantization. We demonstrated that performing a series of computationally efficient low-rank operations on the learned latent representation of images can efficiently reconstruct the data with high quality. Our approach dramatically reduces the computational overhead in the decoding phase of neural compression/reconstruction, essentially eliminating the decoder compute bottleneck while maintaining high fidelity of image outputs.

[406] Edge Artificial Intelligence: A Systematic Review of Evolution, Taxonomic Frameworks, and Future Horizons

Mohamad Abou Ali, Fadi Dornaika

Main category: cs.LG

TL;DR: This review systematically examines Edge AI’s evolution, current state, and future directions through a multi-dimensional taxonomy covering deployment, processing capabilities, applications, and hardware, while addressing challenges and emerging opportunities.

Details

Motivation: To provide a comprehensive analysis of Edge AI's development from early content delivery networks to modern on-device intelligence, addressing the need for systematic understanding of this rapidly evolving field that enables real-time processing with improved privacy and reduced latency.

Method: Following PRISMA guidelines, the review employs a multi-dimensional taxonomy including deployment location, processing capabilities (TinyML, federated learning), application domains, and hardware types to systematically examine Edge AI evolution and current landscape.

Result: The analysis traces Edge AI’s evolution, identifies core enabling technologies (hardware accelerators, optimized software, communication protocols), critically assesses challenges (resource limitations, security, power consumption), and highlights emerging opportunities in neuromorphic hardware, continual learning, and edge-cloud collaboration.

Conclusion: The review provides a comprehensive framework for understanding Edge AI’s trajectory, emphasizing its transformative potential through emerging technologies while addressing persistent challenges, serving as a valuable resource for researchers and practitioners in the field.

Abstract: Edge Artificial Intelligence (Edge AI) embeds intelligence directly into devices at the network edge, enabling real-time processing with improved privacy and reduced latency by processing data close to its source. This review systematically examines the evolution, current landscape, and future directions of Edge AI through a multi-dimensional taxonomy including deployment location, processing capabilities such as TinyML and federated learning, application domains, and hardware types. Following PRISMA guidelines, the analysis traces the field from early content delivery networks and fog computing to modern on-device intelligence. Core enabling technologies such as specialized hardware accelerators, optimized software, and communication protocols are explored. Challenges including resource limitations, security, model management, power consumption, and connectivity are critically assessed. Emerging opportunities in neuromorphic hardware, continual learning algorithms, edge-cloud collaboration, and trustworthiness integration are highlighted, providing a comprehensive framework for researchers and practitioners.

[407] SoftAdaClip: A Smooth Clipping Strategy for Fair and Private Model Training

Dorsa Soleymani, Ali Dadsetan, Frank Rudzicz

Main category: cs.LG

TL;DR: SoftAdaClip replaces hard gradient clipping in DP-SGD with a smooth tanh-based transformation to improve fairness while maintaining differential privacy, reducing subgroup disparities by up to 87% compared to DP-SGD.

Details

Motivation: Differential privacy often reduces model performance and fairness, especially for underrepresented groups, due to gradient clipping in DP-SGD disproportionately suppressing learning signals for minority subpopulations.

Method: SoftAdaClip replaces hard clipping with a smooth, tanh-based transformation to preserve relative gradient magnitudes while bounding sensitivity, integrated with adaptive mechanisms for differentially private training.

Result: SoftAdaClip reduces subgroup disparities by up to 87% compared to DP-SGD and up to 48% compared to Adaptive-DPSGD across datasets including MIMIC-III, GOSSIS-eICU, and Adult Income, with statistically significant reductions.

Conclusion: Integrating smooth transformations with adaptive mechanisms is crucial for achieving fair and private model training, as demonstrated by SoftAdaClip’s significant improvements in reducing subgroup disparities while maintaining differential privacy.

Abstract: Differential privacy (DP) provides strong protection for sensitive data, but often reduces model performance and fairness, especially for underrepresented groups. One major reason is gradient clipping in DP-SGD, which can disproportionately suppress learning signals for minority subpopulations. Although adaptive clipping can enhance utility, it still relies on uniform hard clipping, which may restrict fairness. To address this, we introduce SoftAdaClip, a differentially private training method that replaces hard clipping with a smooth, tanh-based transformation to preserve relative gradient magnitudes while bounding sensitivity. We evaluate SoftAdaClip on various datasets, including MIMIC-III (clinical text), GOSSIS-eICU (structured healthcare), and Adult Income (tabular data). Our results show that SoftAdaClip reduces subgroup disparities by up to 87% compared to DP-SGD and up to 48% compared to Adaptive-DPSGD, and these reductions in subgroup disparities are statistically significant. These findings underscore the importance of integrating smooth transformations with adaptive mechanisms to achieve fair and private model training.

[408] Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

Yifei Zuo, Yutong Yin, Zhichen Zeng, Ang Li, Banghua Zhu, Zhaoran Wang

Main category: cs.LG

TL;DR: Proposes Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics that offers theoretical advantages over existing attention mechanisms and addresses computational challenges with memory-efficient implementations.

Details

Motivation: While efficient alternatives to Softmax Attention have been widely studied, there's a gap in exploring more expressive mechanisms grounded in theoretical insight, even at greater computational cost.

Method: Derived LLA from nonparametric statistics through test-time regression, proposed memory-efficient primitives to handle computational complexity, developed FlashLLA algorithm for hardware-efficient computation, and implemented customized inference kernel.

Result: LLA outperforms strong baselines in test-time training and in-context learning, effectively adapts to non-stationarity, and shows promising evidence for scalability in large-scale models.

Conclusion: LLA bridges the gap between theoretical insight and practical implementation, offering a more expressive attention mechanism with demonstrated advantages in various tasks and potential for large-scale applications.

Abstract: Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2 d)$ and $\Theta(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.

[409] SCOPED: Score-Curvature Out-of-distribution Proximity Evaluator for Diffusion

Brett Barkley, Preston Culbertson, David Fridovich-Keil

Main category: cs.LG

TL;DR: SCOPED is a fast OOD detection method for diffusion models that reduces forward passes by an order of magnitude while maintaining competitive accuracy, using a combination of Jacobian trace and score function norms.

Details

Motivation: Out-of-distribution detection is crucial for reliable ML deployment across vision, robotics, and reinforcement learning, but existing methods are computationally expensive.

Method: Combines Jacobian trace and squared norm of score function into a single test statistic, uses kernel density estimation for flexible thresholding, requires only one forward pass and one JVP with Hutchinson’s trace estimator.

Result: Achieves competitive or state-of-the-art precision-recall scores on four vision benchmarks with low computational cost, and generalizes to robotic control tasks with distribution shifts across reward functions.

Conclusion: SCOPED provides a practical foundation for fast and reliable OOD detection in real-world domains including vision, robotics, RL, and dataset curation.

Abstract: Out-of-distribution (OOD) detection is essential for reliable deployment of machine learning systems in vision, robotics, reinforcement learning, and beyond. We introduce Score-Curvature Out-of-distribution Proximity Evaluator for Diffusion (SCOPED), a fast and general-purpose OOD detection method for diffusion models that reduces the number of forward passes on the trained model by an order of magnitude compared to prior methods, outperforming most diffusion-based baselines and closely approaching the accuracy of the strongest ones. SCOPED is computed from a single diffusion model trained once on a diverse dataset, and combines the Jacobian trace and squared norm of the model’s score function into a single test statistic. Rather than thresholding on a fixed value, we estimate the in-distribution density of SCOPED scores using kernel density estimation, enabling a flexible, unsupervised test that, in the simplest case, only requires a single forward pass and one Jacobian-vector product (JVP), made efficient by Hutchinson’s trace estimator. On four vision benchmarks, SCOPED achieves competitive or state-of-the-art precision-recall scores despite its low computational cost. The same method generalizes to robotic control tasks with shared state and action spaces, identifying distribution shifts across reward functions and training regimes. These results position SCOPED as a practical foundation for fast and reliable OOD detection in real-world domains, including perceptual artifacts in vision, outlier detection in autoregressive models, exploration in reinforcement learning, and dataset curation for unsupervised training.

[410] Fixing That Free Lunch: When, Where, and Why Synthetic Data Fails in Model-Based Policy Optimization

Brett Barkley, David Fridovich-Keil

Main category: cs.LG

TL;DR: MBPO’s synthetic data can degrade performance in DMC tasks despite success in Gym. Two failure modes identified: scale mismatches causing critic underestimation and poor target representation increasing model variance. Fixing these enables MBPO to outperform SAC in most DMC tasks.

Details

Motivation: To understand why MBPO underperforms in DMC tasks despite strong results in Gym, and identify the specific failure modes that prevent policy improvement.

Method: Analyzed MBPO’s performance across seven DMC tasks, identified two coupled issues: scale mismatches between dynamics/reward models causing critic underestimation, and poor target representation inflating model variance.

Result: After addressing the failure modes, MBPO outperforms SAC in five of seven DMC tasks while maintaining strong Gym performance, enabling policy improvement where previously impossible.

Conclusion: Environment-specific assumptions can become encoded in algorithm design when evaluation is limited. Need taxonomies linking MDP structure to failure modes and unified solutions that clarify generalization conditions across benchmarks.

Abstract: Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.

[411] How Well Can Preference Optimization Generalize Under Noisy Feedback?

Shawn Im, Yixuan Li

Main category: cs.LG

TL;DR: This paper analyzes how noisy human feedback affects preference optimization in large language models, providing generalization guarantees for finite-step training under realistic noise conditions.

Details

Motivation: Existing preference optimization methods assume noise-free human feedback, which is unrealistic due to inherent errors and inconsistencies in human judgments. The paper addresses this gap by studying the impact of noisy feedback on model alignment.

Method: The authors analyze noise models corresponding to real-world sources like mislabeling and uncertainty, focusing on finite-step preference optimization rather than assuming convergence. They examine how generalization decays with different noise types and rates based on data distribution and sample size.

Result: The analysis applies broadly to preference optimization losses (DPO, IPO, SLiC, etc.) and shows how noise affects generalization. Empirical validation on contemporary LLMs confirms the practical relevance of the findings.

Conclusion: The work provides valuable insights for developing AI systems that align with human preferences under realistic noisy feedback conditions, offering theoretical guarantees that are more aligned with practical LLM training scenarios.

Abstract: As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

[412] LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

Weizhe Chen, Sven Koenig, Bistra Dilkina

Main category: cs.LG

TL;DR: LSPO is a meta-RLVR algorithm that dynamically selects training data based on average response length to improve learning effectiveness in reasoning tasks.

Details

Motivation: Motivated by studies of overthinking in LLMs, the paper aims to address inefficiencies in RLVR training by leveraging length signals for better data selection.

Method: Proposed Length-aware Sampling for Policy Optimization (LSPO), which dynamically selects training data at each step based on average response length.

Result: LSPO consistently improves learning effectiveness across multiple base models and datasets, with ablation studies providing insights into length signal incorporation.

Conclusion: LSPO demonstrates the value of dynamic length-aware sampling in RLVR, offering promising directions for future research in meta-RLVR algorithms.

Abstract: Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

[413] The Three Regimes of Offline-to-Online Reinforcement Learning

Lu Li, Tianwei Ni, Yihao Sun, Pierre-Luc Bacon

Main category: cs.LG

TL;DR: Offline-to-online RL shows inconsistent behavior across settings. The paper proposes a stability-plasticity principle to explain this: preserve better knowledge (pretrained policy or offline data) while maintaining plasticity, identifying three fine-tuning regimes.

Details

Motivation: Address the high inconsistency in offline-to-online RL where design choices that work well in one setting fail in others, requiring a principled framework to guide fine-tuning decisions.

Method: Propose a stability-plasticity principle framework that identifies three distinct fine-tuning regimes based on relative performance of offline dataset vs pretrained policy, validated through large-scale empirical study.

Result: Empirical validation shows strong alignment with framework predictions in 45 out of 63 cases, demonstrating the framework’s effectiveness in explaining and guiding offline-to-online RL behavior.

Conclusion: The stability-plasticity principle provides a principled framework for making design choices in offline-to-online RL based on the relative performance of offline dataset and pretrained policy.

Abstract: Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability–plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.

[414] Fine-tuning LLMs with variational Bayesian last layer for high-dimensional Bayesian optimzation

Haotian Xiang, Jinwen Xu, Qin Lu

Main category: cs.LG

TL;DR: The paper proposes LoRA-VBLL, a Bayesian optimization method using LLMs as surrogates for high-dimensional black-box optimization with irregular variables, enhanced by ensemble learning for automated hyperparameter selection.

Details

Motivation: Bayesian optimization struggles with high-dimensional problems containing irregular variables (categorical, ordinal). Gaussian processes perform poorly here, while neural network surrogates show promise. LLMs offer powerful modeling capabilities for such complex mappings.

Method: Use LLM as surrogate model with LoRA fine-tuning and variational Bayesian last layer (VBLL) for efficient parameter updates. Develop ensemble method (ENS-LoRA-VBLL) to automatically select LoRA rank and hyperparameters via recursive Bayesian updates.

Result: Extensive experiments show compelling performance on high-dimensional benchmarks and real-world molecular optimization tasks. The approach is computationally efficient with recursive update capabilities.

Conclusion: LLM-based surrogates with LoRA-VBLL framework provide effective solution for high-dimensional Bayesian optimization with irregular variables, outperforming traditional methods while maintaining computational efficiency.

Abstract: A plethora of applications entail solving black-box optimization problems with high evaluation costs, including drug discovery, material design, as well as hyperparameter tuning. Toward finding the global optimum of such black-box optimization problems with sample efficiency, Bayesian optimization (BO) is a theoretically elegant framework that relies on a probabilistic surrogate model so as to iteratively select the query point with well-balanced exploration-exploitation tradeoffs. The Gaussian process (GP), as the de-facto choice for surrogate modeling, has achieved compelling performances for vanilla BO with low-dimensional continuous variables. However, GPs fall short in coping with high-dimensional counterparts with {\it irregular} variables (e.g., categorical, ordinal, etc.). To alleviate this, neural network-based surrogates have been explored. Inspired by the powerful capabilities of LLMs, we adopt the LLM as the surrogate to model the mapping from the high-dimensional input variables to the objective function. To adapt to the current problem, we leverage the low-rank adaptation (LoRA) to fine-tune the LLM parameters together with the posterior of a linear regression head via the variational Bayesian last layer (VBLL) framework. The resulting LoRA-VBLL is not only computationally light compared to existing alternatives, but also admits recursive updates. To automate the critical selection of the LoRA rank as well as other hyperparameters, a weighted ensemble (ENS) of LoRA-VBLL surrogates has been devised, which further accommodates continual update of the per-model weight and individual LoRA-VBLL parameters via recursive Bayes. Extensive experimental results demonstrate the compelling performance of the proposed (ENS-)LoRA-VBLL approaches on various high-dimensional benchmarks and the real-world molecular optimization tasks.

[415] PEL-NAS: Search Space Partitioned Architecture Prompt Co-Evolutionary LLM-driven Hardware-Aware Neural Architecture Search

Hengyi Zhu, Grace Li Zhang, Shaoyi Huang

Main category: cs.LG

TL;DR: PEL-NAS is a novel HW-NAS method that partitions search space by complexity and uses LLM-driven co-evolution to efficiently find high-accuracy, low-latency neural architectures, reducing search time from days to minutes.

Details

Motivation: Traditional supernet-based HW-NAS methods require multiple GPU days per dataset, while LLM-driven approaches suffer from exploration bias where they repeatedly propose designs within limited search space and fail to discover architectures across different latency ranges.

Method: Three key components: 1) complexity-driven partitioning engine to divide search space by complexity for diversity; 2) LLM-powered architecture prompt co-evolution operator that updates design heuristics knowledge base and performs guided evolution; 3) zero-cost predictor to avoid training candidates from scratch.

Result: On HW-NAS-Bench, PEL-NAS achieves higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Search cost drops from days to minutes compared with traditional supernet baselines.

Conclusion: PEL-NAS effectively addresses exploration bias in LLM-driven NAS through search space partitioning and co-evolution, enabling efficient discovery of high-performance neural architectures across different latency ranges with significantly reduced search time.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) requires joint optimization of accuracy and latency under device constraints. Traditional supernet-based methods require multiple GPU days per dataset. Large Language Model (LLM)-driven approaches avoid training a large supernet and can provide quick feedback, but we observe an exploration bias: the LLM repeatedly proposes neural network designs within limited search space and fails to discover architectures across different latency ranges in the entire search space. To address this issue, we propose PEL-NAS: a search space Partitioned, architecture prompt co-Evolutionary and LLM-driven Neural Architecture Search that can generate neural networks with high accuracy and low latency with reduced search cost. Our proposed PEL-NAS has three key components: 1) a complexity-driven partitioning engine that divides the search space by complexity to enforce diversity and mitigate exploration bias; 2) an LLM-powered architecture prompt co-evolution operator, in which the LLM first updates a knowledge base of design heuristics based on results from the previous round, then performs a guided evolution algorithm on architectures with prompts that incorporate this knowledge base. Prompts and designs improve together across rounds which avoids random guesswork and improve efficiency; 3) a zero-cost predictor to avoid training a large number of candidates from scratch. Experimental results show that on HW-NAS-Bench, PEL-NAS can achieve overall higher HV, lower IGD, and up to 54% lower latency than baselines at similar accuracy. Meanwhile, the search cost drops from days to minutes compared with traditional supernet baselines.

[416] Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Shriram Karpoora Sundara Pandian, Ali Baheri

Main category: cs.LG

TL;DR: Weighted BC is a robust imitation learning method that uses density ratio weighting with a clean reference set to handle contaminated offline RL datasets, outperforming standard methods even with high contamination rates.

Details

Motivation: Offline RL datasets often contain corrupted data from adversarial poisoning or system errors, which degrades policy performance in standard behavioral cloning and offline RL methods.

Method: Uses a small clean reference set to estimate trajectory-level density ratios via binary discriminator, then applies clipped ratios as weights in BC objective to prioritize clean expert behavior while down-weighting corrupted data.

Result: Maintains near-optimal performance even at high contamination ratios, outperforming baselines like traditional BC, BCQ, and BRAC across various poisoning protocols on continuous control benchmarks.

Conclusion: Weighted BC provides a robust solution for offline RL with contaminated datasets, with theoretical guarantees and practical effectiveness independent of contamination mechanisms.

Abstract: Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

[417] Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Liu, Sanmi Koyejo

Main category: cs.LG

TL;DR: This paper explains why adversarial attacks transfer between models in data-space but not in representation-space, showing that transferability depends on the attack domain rather than being an inherent property.

Details

Motivation: To explain the puzzling observation that image jailbreaks don't transfer between vision-language models (VLMs) while adversarial examples transfer between image classifiers and text jailbreaks transfer between language models.

Method: The authors propose a theoretical distinction between data-space and representation-space attacks, then provide evidence through mathematical proofs, empirical experiments with image classifiers, language models, and vision-language models, testing transferability across different attack domains.

Result: Data-space attacks successfully transfer between models, while representation-space attacks fail to transfer unless there’s geometric alignment of representations. The authors demonstrated this in four settings: mathematical proof, image classifiers, language models, and vision-language models.

Conclusion: Adversarial transfer is not an inherent property of all attacks but depends on their operational domain - attacks in shared data-space transfer, while attacks in models’ unique representation spaces do not transfer without geometric alignment.

Abstract: The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks \emph{can} transfer when VLMs’ latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models’ unique representation spaces - a critical insight for building more robust models.

[418] Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, Haifeng Xu

Main category: cs.LG

TL;DR: The paper proposes two new aggregation algorithms (Optimal Weight and Inverse Surprising Popularity) for multi-agent LLM reasoning that outperform standard majority voting by leveraging first-order and second-order information to account for model heterogeneity and correlation.

Details

Motivation: Standard majority voting treats all LLM answers equally, failing to consider latent heterogeneity and correlation across models, which limits the reliability of collective decisions in multi-agent LLM systems.

Method: Designed two aggregation algorithms: Optimal Weight (OW) and Inverse Surprising Popularity (ISP) that leverage both first-order and second-order information to better account for model dependencies and improve aggregation.

Result: Empirical validation on synthetic datasets, LLM benchmarks (UltraFeedback, MMLU), and real-world healthcare setting (ARMMAN) shows consistent outperformance over majority voting across all cases.

Conclusion: The proposed methods offer both practical performance gains and conceptual insights for designing robust multi-agent LLM pipelines, provably mitigating inherent limitations of majority voting under mild assumptions.

Abstract: With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.

[419] Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control

Will Y. Zou, Jean Feng, Alexandre Kalimouttou, Jennifer Yuntong Zhang, Christopher W. Seymour, Romain Pirracchio

Main category: cs.LG

TL;DR: An end-to-end RL approach for optimal dual vasopressor dosing in ICU septic shock patients, using action space design to improve interpretability and clinical adoption while maintaining efficacy.

Details

Motivation: Address practitioner skepticism about RL in CDSS by making dosing decisions more interpretable and operable, particularly for dual vasopressor administration in ICU settings.

Method: Combines offline conservative Q-learning with novel recurrent modeling in replay buffer to capture temporal dependencies; designs action spaces accommodating discrete, continuous, and directional dosing strategies.

Result: Action space design significantly influences learned policies; achieves over 15% survival improvement probability on eICU and MIMIC datasets while aligning with clinical protocols.

Conclusion: Proper action space design in RL for clinical dosing improves interpretability and facilitates adoption while preserving treatment efficacy, bridging the gap between AI and clinical practice.

Abstract: Reinforcement learning (RL) applications in Clinical Decision Support Systems (CDSS) frequently encounter skepticism from practitioners regarding inoperable dosing decisions. We address this challenge with an end-to-end approach for learning optimal drug dosing and control policies for dual vasopressor administration in intensive care unit (ICU) patients with septic shock. For realistic drug dosing, we apply action space design that accommodates discrete, continuous, and directional dosing strategies in a system that combines offline conservative Q-learning with a novel recurrent modeling in a replay buffer to capture temporal dependencies in ICU time-series data. Our comparative analysis of norepinephrine dosing strategies across different action space formulations reveals that the designed action spaces improve interpretability and facilitate clinical adoption while preserving efficacy. Empirical results1 on eICU and MIMIC demonstrate that action space design profoundly influences learned behavioral policies. The proposed methods achieve improved patient outcomes of over 15% in survival improvement probability, while aligning with established clinical protocols.

[420] Flock: A Knowledge Graph Foundation Model via Learning on Random Walks

Jinwoo Kim, Xingyue Huang, Krzysztof Olejniczak, Kyungbin Min, Michael Bronstein, Seunghoon Hong, İsmail İlkan Ceylan

Main category: cs.LG

TL;DR: Flock introduces probabilistic node-relation equivariance for zero-shot link prediction on knowledge graphs, overcoming limitations of deterministic equivariance and achieving state-of-the-art performance.

Details

Motivation: Current knowledge graph foundation models (KGFMs) use deterministic equivariance which limits expressive power and prevents distinguishing structurally similar but semantically distinct relations.

Method: Flock uses probabilistic node-relation equivariance, iteratively samples random walks, encodes them via recording protocol, embeds with sequence model, and aggregates via learned pooling.

Result: Flock perfectly solves the new diagnostic dataset Petals where current KGFMs fail, and achieves state-of-the-art performance on entity and relation prediction tasks across 54 diverse knowledge graphs.

Conclusion: Probabilistic node-relation equivariance enables more expressive KGFMs that can generalize better to novel entities and relations while maintaining equivariance properties.

Abstract: We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize over novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, learning from structural properties of nodes and relations, which are then transferable to novel graphs with similar structural properties. However, the conventional notion of deterministic equivariance imposes inherent limits on the expressive power of KGFMs, preventing them from distinguishing structurally similar but semantically distinct relations. To overcome this limitation, we introduce probabilistic node-relation equivariance, which preserves equivariance in distribution while incorporating a principled randomization to break symmetries during inference. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences via a recording protocol, embeds them with a sequence model, and aggregates representations of nodes and relations via learned pooling. Crucially, Flock respects probabilistic node-relation equivariance and is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals where current KGFMs fail, and achieves state-of-the-art performances on entity- and relation prediction tasks on 54 KGs from diverse domains.

[421] Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties

Hossein Sholehrasa, Xuan Xu, Doina Caragea, Jim E. Riviere, Majid Jaberi-Douraki

Main category: cs.LG

TL;DR: A predictive framework using machine learning to classify veterinary drug adverse event outcomes (Death vs. Recovery) from FDA data, achieving high performance through ensemble methods and interpretable AI.

Details

Motivation: To protect animal welfare and human food safety by predicting adverse events that may lead to violative residues in the food chain, supporting early detection of high-risk drug-event profiles.

Method: Used ~1.28 million FDA reports with preprocessing, VeDDRA ontology standardization, feature engineering including drug properties, and evaluated multiple ML models (Random Forest, CatBoost, XGBoost, ExcelFormer, LLMs) with ensemble methods and AUM-based pseudo-labeling.

Result: Ensemble methods and CatBoost achieved precision, recall, and F1-scores of 0.95. SHAP analysis identified biologically plausible predictors including lung/heart disorders, animal demographics, and drug properties linked to fatal outcomes.

Conclusion: Combining data engineering, advanced ML, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes, supporting regulatory decision-making and residue risk assessment.

Abstract: The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA’s OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improved minority-class detection, particularly in ExcelFormer and XGBoost. Interpretability via SHAP identified biologically plausible predictors, including lung, heart, and bronchial disorders, animal demographics, and drug physicochemical properties. These features were strongly linked to fatal outcomes. Overall, the framework shows that combining rigorous data engineering, advanced machine learning, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes. The approach supports FARAD’s mission by enabling early detection of high-risk drug-event profiles, strengthening residue risk assessment, and informing regulatory and clinical decision-making.

[422] CarbonX: An Open-Source Tool for Computational Decarbonization Using Time Series Foundation Models

Diptyaroop Maji, Kang Yang, Prashant Shenoy, Ramesh K Sitaraman, Mani Srivastava

Main category: cs.LG

TL;DR: CarbonX is an open-source tool that uses Time Series Foundation Models to provide accurate carbon intensity forecasts globally without requiring grid-specific electricity mix data, achieving 15.82% MAPE across 214 grids with uncertainty estimates.

Details

Motivation: Existing carbon intensity forecasting tools have limitations: they require grid-specific electricity mix data (unavailable in many regions), use separate grid-specific models limiting global coverage, and lack uncertainty estimates which reduces reliability for carbon-aware applications.

Method: CarbonX leverages Time Series Foundation Models (TSFMs) that use only historical carbon intensity data and a single general model to perform multiple decarbonization tasks including forecasting and imputation across diverse grids worldwide.

Result: CarbonX achieves zero-shot forecasting MAPE of 15.82% across 214 grids, comparable performance to state-of-the-art on 13 benchmark grids (9.59% average MAPE, 16.54% tail MAPE), provides 95% coverage prediction intervals, forecasts up to 21 days with minimal accuracy degradation, and outperforms statistical baselines by 1.2-3.9X on imputation when fine-tuned.

Conclusion: CarbonX can be easily deployed on any grid with limited data while delivering strong performance, making it a practical and scalable tool for global computational decarbonization efforts.

Abstract: Computational decarbonization aims to reduce carbon emissions in computing and societal systems such as data centers, transportation, and built environments. This requires accurate, fine-grained carbon intensity forecasts, yet existing tools have several key limitations: (i) they require grid-specific electricity mix data, restricting use where such information is unavailable; (ii) they depend on separate grid-specific models that make it challenging to provide global coverage; and (iii) they provide forecasts without uncertainty estimates, limiting reliability for downstream carbon-aware applications. In this paper, we present CarbonX, an open-source tool that leverages Time Series Foundation Models (TSFMs) for a range of decarbonization tasks. CarbonX utilizes the versatility of TSFMs to provide strong performance across multiple tasks, such as carbon intensity forecasting and imputation, and across diverse grids. Using only historical carbon intensity data and a single general model, our tool achieves a zero-shot forecasting Mean Absolute Percentage Error (MAPE) of 15.82% across 214 grids worldwide. Across 13 benchmark grids, CarbonX performance is comparable with the current state-of-the-art, with an average MAPE of 9.59% and tail forecasting MAPE of 16.54%, while also providing prediction intervals with 95% coverage. CarbonX can provide forecasts for up to 21 days with minimal accuracy degradation. Further, when fully fine-tuned, CarbonX outperforms the statistical baselines by 1.2–3.9X on the imputation task. Overall, these results demonstrate that CarbonX can be used easily on any grid with limited data and still deliver strong performance, making it a practical tool for global-scale decarbonization.

[423] On Integer Programming for the Binarized Neural Network Verification Problem

Woojin Kim, James R. Luedtke

Main category: cs.LG

TL;DR: The paper presents improved integer programming (IP) formulations for verifying binarized neural networks (BNNs) against input perturbations, with techniques for better linear objectives and valid inequalities.

Details

Motivation: BNN verification is challenging due to large integrality gaps in natural IP formulations caused by big-M constraints, limiting the ability to verify robustness against input perturbations.

Method: Two techniques: 1) A new method for obtaining linear objectives in multi-class settings, and 2) A novel technique for generating valid inequalities that exploit BNNs’ recursive structure.

Result: The improved IP formulation enables verifying BNNs against higher ranges of input perturbation than existing IP approaches within limited time constraints.

Conclusion: The proposed techniques significantly enhance BNN verification capabilities by addressing key limitations in traditional IP formulations, making them more effective for robustness analysis.

Abstract: Binarized neural networks (BNNs) are feedforward neural networks with binary weights and activation functions. In the context of using a BNN for classification, the verification problem seeks to determine whether a small perturbation of a given input can lead it to be misclassified by the BNN, and the robustness of the BNN can be measured by solving the verification problem over multiple inputs. The BNN verification problem can be formulated as an integer programming (IP) problem. However, the natural IP formulation is often challenging to solve due to a large integrality gap induced by big-$M$ constraints. We present two techniques to improve the IP formulation. First, we introduce a new method for obtaining a linear objective for the multi-class setting. Second, we introduce a new technique for generating valid inequalities for the IP formulation that exploits the recursive structure of BNNs. We find that our techniques enable verifying BNNs against a higher range of input perturbation than existing IP approaches within a limited time.

[424] Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs

Lecheng Kong, Xiyuan Wang, Yixin Chen, Muhan Zhang

Main category: cs.LG

TL;DR: RTRL improves LLM consistency in chemistry tasks by using round-trip success as reinforcement learning reward, boosting performance across data regimes.

Details

Motivation: LLMs in chemistry lack round-trip consistency (can't reconstruct original structures from their own generated text), indicating memorization rather than true mastery.

Method: Round-Trip Reinforcement Learning (RTRL) framework that uses round-trip transformation success as reward signal, with iterative variant where forward/reverse mappings alternately train each other.

Result: Significantly boosts performance and consistency over strong baselines across supervised, self-supervised, and synthetic data regimes.

Conclusion: Round-trip consistency is a trainable objective that offers path toward more robust foundation models, not just a desirable property.

Abstract: Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model’s round-trip consistency and its performance on the primary tasks. This strong correlation reframes consistency into a direct target for model improvement. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly \textbf{boosts performance and consistency} over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models.

[425] Bypassing Prompt Guards in Production with Controlled-Release Prompting

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang

Main category: cs.LG

TL;DR: A new attack method bypasses prompt guards in LLMs by exploiting resource asymmetry between lightweight guards and main models, successfully jailbreaking production models while maintaining response quality.

Details

Motivation: To highlight limitations of prompt guards in ensuring AI safety and alignment, demonstrating their vulnerability to sophisticated attacks despite being popular lightweight defense mechanisms.

Method: Exploits resource asymmetry between prompt guards and main LLMs by encoding jailbreak prompts that lightweight guards cannot decode but the main model can interpret.

Result: Successfully jailbreaks production models including Google Gemini, DeepSeek Chat, Grok, and Mistral Le Chat while maintaining response quality, even under highly protected chat interfaces.

Conclusion: Reveals inherent attack surface in lightweight prompt guards and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs, while also identifying other critical alignment issues.

Abstract: As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while being easy to implement and update. In this work, we introduce a new attack that circumvents such prompt guards, highlighting their limitations. Our method consistently jailbreaks production models while maintaining response quality, even under the highly protected chat interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs. We additionally identify other critical alignment issues, such as copyrighted data extraction, training data extraction, and malicious response leakage during thinking.

[426] NVIDIA AI Aerial: AI-Native Wireless Communications

Kobi Cohen-Arazi, Michael Roe, Zhen Hu, Rohan Chavan, Anna Ptasznik, Joanna Lin, Joao Morais, Joseph Boccuzzi, Tommaso Balercia

Main category: cs.LG

TL;DR: A framework that compiles Python-based AI/ML algorithms into GPU-runnable blobs for 6G networks, enabling efficient integration of DSP and ML in cellular systems with demonstrated channel estimation using CNNs.

Details

Motivation: 6G requires seamless integration of digital signal processing and machine learning in cellular networks, bringing network life cycles closer to AI systems where models are iteratively trained, simulated, and deployed.

Method: Proposes a robust framework that compiles Python-based algorithms into GPU-runnable blobs, using convolutional neural networks for channel estimation in PUSCH receivers, tested first in digital twins then real-time testbeds via NVIDIA AI Aerial platform.

Result: Achieves a unified approach ensuring efficiency, flexibility, and highest performance on NVIDIA GPUs, successfully demonstrating channel estimation function through CNN implementation.

Conclusion: The methodology lays foundation for scalable AI/ML integration into next-generation cellular systems and is essential for realizing natively intelligent 6G networks.

Abstract: 6G brings a paradigm shift towards AI-native wireless systems, necessitating the seamless integration of digital signal processing (DSP) and machine learning (ML) within the software stacks of cellular networks. This transformation brings the life cycle of modern networks closer to AI systems, where models and algorithms are iteratively trained, simulated, and deployed across adjacent environments. In this work, we propose a robust framework that compiles Python-based algorithms into GPU-runnable blobs. The result is a unified approach that ensures efficiency, flexibility, and the highest possible performance on NVIDIA GPUs. As an example of the capabilities of the framework, we demonstrate the efficacy of performing the channel estimation function in the PUSCH receiver through a convolutional neural network (CNN) trained in Python. This is done in a digital twin first, and subsequently in a real-time testbed. Our proposed methodology, realized in the NVIDIA AI Aerial platform, lays the foundation for scalable integration of AI/ML models into next-generation cellular systems, and is essential for realizing the vision of natively intelligent 6G networks.

[427] TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis

Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, Chenyu You

Main category: cs.LG

TL;DR: TSci is an LLM-driven agentic framework for time series forecasting that uses four specialized agents (Curator, Planner, Forecaster, Reporter) to automate preprocessing, model selection, validation, and reporting, achieving superior performance while providing interpretable results.

Details

Motivation: Current time series forecasting methods require extensive manual preprocessing, validation, and ensembling, are domain-specific, and generalize poorly. There's an urgent need for a domain-agnostic framework that minimizes human intervention.

Method: Four-agent framework: Curator performs LLM-guided diagnostics and preprocessing; Planner narrows model choices using multi-modal diagnostics; Forecaster performs model fitting, validation, and adaptive ensemble selection; Reporter synthesizes the process into comprehensive reports.

Result: Empirical results on eight benchmarks show TSci consistently outperforms statistical and LLM-based baselines, reducing forecast error by 10.4% and 38.2% respectively, while producing transparent reports.

Conclusion: TSci transforms forecasting into an interpretable white-box system that is extensible across tasks, demonstrating the effectiveness of LLM-driven agentic frameworks for automating complex time series forecasting workflows.

Abstract: Time series forecasting is central to decision-making in domains as diverse as energy, finance, climate, and public health. In practice, forecasters face thousands of short, noisy series that vary in frequency, quality, and horizon, where the dominant cost lies not in model fitting, but in the labor-intensive preprocessing, validation, and ensembling required to obtain reliable predictions. Prevailing statistical and deep learning models are tailored to specific datasets or domains and generalize poorly. A general, domain-agnostic framework that minimizes human intervention is urgently in demand. In this paper, we introduce TimeSeriesScientist (TSci), the first LLM-driven agentic framework for general time series forecasting. The framework comprises four specialized agents: Curator performs LLM-guided diagnostics augmented by external tools that reason over data statistics to choose targeted preprocessing; Planner narrows the hypothesis space of model choice by leveraging multi-modal diagnostics and self-planning over the input; Forecaster performs model fitting and validation and, based on the results, adaptively selects the best model configuration as well as ensemble strategy to make final predictions; and Reporter synthesizes the whole process into a comprehensive, transparent report. With transparent natural-language rationales and comprehensive reports, TSci transforms the forecasting workflow into a white-box system that is both interpretable and extensible across tasks. Empirical results on eight established benchmarks demonstrate that TSci consistently outperforms both statistical and LLM-based baselines, reducing forecast error by an average of 10.4% and 38.2%, respectively. Moreover, TSci produces a clear and rigorous report that makes the forecasting workflow more transparent and interpretable.

[428] Executable Counterfactuals: Improving LLMs’ Causal Reasoning Through Code

Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, Hao Peng

Main category: cs.LG

TL;DR: The paper introduces ’executable counterfactuals’ framework to properly assess LLMs’ counterfactual reasoning by requiring all three steps (abduction, intervention, prediction), revealing significant performance drops (25-40%) in state-of-the-art models when abduction is included.

Details

Motivation: Existing evaluations of LLMs' counterfactual reasoning skip the abduction step, leading to overestimated performance. This gap is problematic for applications in high-stakes domains like scientific research where full causal understanding is essential.

Method: Developed executable counterfactuals framework using code and math problems that explicitly requires abduction, intervention, and prediction. Created scalable synthetic data with varying difficulty and tested different training approaches including supervised finetuning and reinforcement learning.

Result: SOTA models showed 25-40% accuracy drop from interventional to counterfactual reasoning. Supervised finetuning improved in-domain performance but hurt OOD generalization, while RL induced core cognitive behaviors and generalized better, achieving 1.5x-2x improvements on both code and math problems.

Conclusion: Reinforcement learning is more effective than supervised finetuning for improving LLMs’ counterfactual reasoning as it induces fundamental cognitive behaviors that generalize across domains, highlighting RL’s promise for advancing causal understanding in LLMs.

Abstract: Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternatives (interventions), and predicting their outcomes (prediction). This skill is essential for advancing LLMs’ causal understanding and expanding their applications in high-stakes domains such as scientific research. However, existing efforts in assessing LLM’s counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to overestimation of LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a frontier for evaluating and improving LLM’s reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for SOTA models like o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-else condition and test on out-of-domain code structures (e.g. having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While supervised finetuning on stronger models’ reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on OOD tasks such as counterfactual math problems. In contrast, reinforcement learning induces the core cognitive behaviors and generalizes to new domains, yielding gains over the base model on both code (improvement of 1.5x-2x) and math problems. Analysis of the reasoning traces reinforces these findings and highlights the promise of RL for improving LLMs’ counterfactual reasoning.

[429] Predictive Preference Learning from Human Interventions

Haoyuan Cai, Zhenghao Peng, Bolei Zhou

Main category: cs.LG

TL;DR: PPL (Predictive Preference Learning) is an interactive imitation learning method that propagates human interventions into future states to improve safety and efficiency in autonomous systems.

Details

Motivation: Current interactive imitation learning methods only correct agent actions at current states but fail to adjust future potentially hazardous actions, limiting their effectiveness in safety-critical applications.

Method: PPL bootstraps each human intervention into L future time steps (preference horizon), assuming the agent follows the same action and human makes the same intervention. It applies preference optimization on these future states to propagate expert corrections into safety-critical regions.

Result: The method significantly improves learning efficiency and reduces human demonstrations needed in autonomous driving and robotic manipulation benchmarks, while theoretical analysis shows appropriate preference horizon selection balances risky state coverage with label correctness.

Conclusion: PPL effectively propagates human interventions into future states, enhancing safety and learning efficiency in interactive imitation learning systems with bounded algorithmic optimality gap.

Abstract: Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl

[430] MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah

Main category: cs.LG

TL;DR: MIRA is a training-free inference-time alignment method that prevents reward hacking in diffusion models by using image-space KL regularization, achieving better reward optimization while preserving prompt adherence compared to existing methods.

Details

Motivation: Diffusion models often fail to satisfy user-specific criteria measured by scalar rewards, and existing inference-time alignment methods suffer from reward hacking where images score highly but deviate from the original prompt.

Method: MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution to prevent off-distribution drift while increasing rewards.

Result: Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets, MIRA achieves >60% win rate vs. strong baselines while preserving prompt adherence, with reward gains and near-zero drift.

Conclusion: MIRA effectively mitigates reward hacking in diffusion models through image-space constraints, and MIRA-DPO extends this approach to non-differentiable rewards without fine-tuning.

Abstract: Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

[431] Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

Main category: cs.LG

TL;DR: The paper establishes a unified framework showing that ‘$k_1$ in reward’ (PPO-style) and ‘$k_2$ as loss’ are gradient-equivalent implementations of Reverse KL regularization, while ‘$k_3$ as loss’ (GRPO-style) is a biased approximation.

Details

Motivation: To address the inconsistent implementation of KL divergence loss in RLHF methods, where some approaches treat it as a detached coefficient while others use it as a direct loss function, leading to potential theoretical issues.

Method: Developed a unified framework connecting different KL regularization implementations, analyzed gradient equivalence between ‘$k_1$ in reward’ and ‘$k_2$ as loss’, and identified biases in ‘$k_3$ as loss’ and off-policy implementations.

Result: Proved that ‘$k_1$ in reward’ and ‘$k_2$ as loss’ are theoretically sound implementations of RKL regularization, while ‘$k_3$ as loss’ is a biased first-order approximation. Also identified bias in off-policy implementations due to neglected importance sampling.

Conclusion: Provides a comprehensive gradient-based rationale for choosing and correctly implementing KL regularization in RLHF, enabling more robust and effective systems through principled loss formulations.

Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term’s functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy’s score function (’$k_n$ in reward’) or as a direct loss function through which gradients are propagated (’$k_n$ as loss’). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional ‘$k_1$ in reward’ (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the ‘$k_2$ as loss’ formulation is, in fact, gradient-equivalent to ‘$k_1$ in reward’. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted ‘$k_3$ as loss’ (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of ‘$k_n$ as loss’ methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

[432] Large-Scale Bayesian Causal Discovery with Interventional Data

Seong Woo Han, Daniel Duy Vo, Brielin C. Brown

Main category: cs.LG

TL;DR: IBCD is an empirical Bayesian framework for causal discovery using interventional data that models total causal effects with matrix normal distribution, uses spike-and-slab horseshoe priors, and provides uncertainty-aware inference.

Details

Motivation: Existing causal discovery methods perform poorly on large-scale tasks and fail to quantify uncertainty, despite advancements from genomic perturbation screens.

Method: Models likelihood of total causal effects matrix using matrix normal distribution, places spike-and-slab horseshoe prior on edges, learns data-driven weights for scale-free and Erdős-Rényi structures from observational data, and treats edges as latent variables.

Result: Achieves superior structure recovery compared to existing baselines in simulations, and successfully applied to CRISPR perturbation data on 521 genes with robust graph structures identified via edge posterior inclusion probabilities.

Conclusion: IBCD provides an effective uncertainty-aware framework for large-scale causal discovery with interventional data, enabling robust identification of causal relationships.

Abstract: Inferring the causal relationships among a set of variables in the form of a directed acyclic graph (DAG) is an important but notoriously challenging problem. Recently, advancements in high-throughput genomic perturbation screens have inspired development of methods that leverage interventional data to improve model identification. However, existing methods still suffer poor performance on large-scale tasks and fail to quantify uncertainty. Here, we propose Interventional Bayesian Causal Discovery (IBCD), an empirical Bayesian framework for causal discovery with interventional data. Our approach models the likelihood of the matrix of total causal effects, which can be approximated by a matrix normal distribution, rather than the full data matrix. We place a spike-and-slab horseshoe prior on the edges and separately learn data-driven weights for scale-free and Erd\H{o}s-R'enyi structures from observational data, treating each edge as a latent variable to enable uncertainty-aware inference. Through extensive simulation, we show that IBCD achieves superior structure recovery compared to existing baselines. We apply IBCD to CRISPR perturbation (Perturb-seq) data on 521 genes, demonstrating that edge posterior inclusion probabilities enable identification of robust graph structures.

[433] TetriServe: Efficient DiT Serving for Heterogeneous Image Generation

Runyu Lu, Shiqi He, Wenxuan Tan, Shenggui Li, Ruofan Wu, Jeff J. Ma, Ang Chen, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: TetriServe introduces step-level sequence parallelism for DiT serving, dynamically adjusting parallelism per request deadline to improve SLO attainment by up to 32% compared to existing systems.

Details

Motivation: Existing DiT serving systems use fixed parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment.

Method: TetriServe implements step-level sequence parallelism with round-based scheduling: discretizing time into fixed rounds, adapting parallelism at step level to minimize GPU hours, and jointly packing requests to minimize late completions.

Result: Extensive evaluation shows TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

Conclusion: Step-level sequence parallelism with dynamic adjustment based on deadlines enables highly efficient DiT serving with significantly improved SLO performance.

Abstract: Diffusion Transformer (DiT) models excel at generating highquality images through iterative denoising steps, but serving them under strict Service Level Objectives (SLOs) is challenging due to their high computational cost, particularly at large resolutions. Existing serving systems use fixed degree sequence parallelism, which is inefficient for heterogeneous workloads with mixed resolutions and deadlines, leading to poor GPU utilization and low SLO attainment. In this paper, we propose step-level sequence parallelism to dynamically adjust the parallel degree of individual requests according to their deadlines. We present TetriServe, a DiT serving system that implements this strategy for highly efficient image generation. Specifically, TetriServe introduces a novel round-based scheduling mechanism that improves SLO attainment: (1) discretizing time into fixed rounds to make deadline-aware scheduling tractable, (2) adapting parallelism at the step level and minimize GPU hour consumption, and (3) jointly packing requests to minimize late completions. Extensive evaluation on state-of-the-art DiT models shows that TetriServe achieves up to 32% higher SLO attainment compared to existing solutions without degrading image quality.

[434] From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?

Hanqun Cao, Hongrui Zhang, Junde Xu, Zhou Zhang, Lingdong Shen, Minghao Sun, Ge Liu, Jinbo Xu, Wu-Jun Li, Jinren Ni, Cesar de la Fuente-Nunez, Tianfan Fu, Yejin Choi, Pheng-Ann Heng, Fang Wu

Main category: cs.LG

TL;DR: RL combined with PLMs improves protein design across multiple domains, with performance gains depending on task headroom, reward fidelity, and policy capacity.

Details

Motivation: To determine if reinforcement learning can push protein language models beyond their pretraining priors to uncover latent sequence-structure-function rules that supervised learning might miss.

Method: Pairing RL with PLMs across four domains (antimicrobial peptide design, kinase variant optimization, antibody engineering, inverse folding) using diverse RL algorithms and model classes.

Result: RL consistently boosts success rates and sample efficiency across benchmarks. Performance follows a three-factor interaction: task headroom, reward fidelity, and policy capacity jointly determine gains.

Conclusion: Practical guidance for RL in protein design: prioritize reward modeling and calibration before scaling policy size, match algorithm and regularization strength to task difficulty, and allocate capacity where marginal gains are largest.

Abstract: Protein language models (PLMs) have advanced computational protein science through large-scale pretraining and scalable architectures. In parallel, reinforcement learning (RL) has broadened exploration and enabled precise multi-objective optimization in protein design. Yet whether RL can push PLMs beyond their pretraining priors to uncover latent sequence-structure-function rules remains unclear. We address this by pairing RL with PLMs across four domains: antimicrobial peptide design, kinase variant optimization, antibody engineering, and inverse folding. Using diverse RL algorithms and model classes, we ask if RL improves sampling efficiency and, more importantly, if it reveals capabilities not captured by supervised learning. Across benchmarks, RL consistently boosts success rates and sample efficiency. Performance follows a three-factor interaction: task headroom, reward fidelity, and policy capacity jointly determine gains. When rewards are accurate and informative, policies have sufficient capacity, and tasks leave room beyond supervised baselines, improvements scale; when rewards are noisy or capacity is constrained, gains saturate despite exploration. This view yields practical guidance for RL in protein design: prioritize reward modeling and calibration before scaling policy size, match algorithm and regularization strength to task difficulty, and allocate capacity where marginal gains are largest. Implementation is available at https://github.com/chq1155/RL-PLM.

[435] Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control

Haochen You, Baojing Liu

Main category: cs.LG

TL;DR: SPAMP is a unified framework that generalizes gradient clipping into smooth, per-layer gradient shaping using statistical tracking and power-based transformations.

Details

Motivation: Traditional gradient clipping uses hard, fixed thresholds that lack flexibility and ignore gradient distribution dynamics, limiting its effectiveness in stabilizing deep network training.

Method: SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner, treating clipping and warmup as dual mechanisms for controlling effective update scale.

Result: Extensive experiments across image and language tasks show that SPAMP improves stability, convergence, and robustness over existing gradient clipping methods.

Conclusion: SPAMP provides a principled alternative to rigid heuristics by offering a unified framework for adaptive gradient shaping that generalizes and improves upon traditional clipping approaches.

Abstract: Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner. This perspective recasts clipping and warmup as dual mechanisms for controlling the effective update scale $\eta_t |g_t|$, offering a principled alternative to rigid heuristics. Extensive experiments across image and language tasks demonstrate that SPAMP improves stability, convergence, and robustness over existing methods.

[436] Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression

Joykirat Singh, Justin Chih-Yao Chen, Archiki Prasad, Elias Stengel-Eskin, Akshay Nambi, Mohit Bansal

Main category: cs.LG

TL;DR: TRAAC is an RL method that adaptively allocates reasoning budget based on task difficulty, using self-attention to compress reasoning steps and avoid both underthinking and overthinking.

Details

Motivation: Current thinking models suffer from under-adaptivity - they either underthink (too short reasoning for hard problems) or overthink (excessively long reasoning even after solving), leading to inefficiency and errors.

Method: TRAAC uses online post-training RL with self-attention over reasoning trajectories to identify important steps, prune redundant ones, and incorporate difficulty estimation into training rewards.

Result: TRAAC achieves 8.4% accuracy gain with 36.8% reasoning length reduction vs base model, and 7.9% accuracy gain with 29.4% length reduction vs best RL baseline across multiple tasks including math and non-math datasets.

Conclusion: TRAAC enables adaptive thinking by combining task-difficulty calibration and attention-based compression, showing strong generalization across diverse reasoning tasks.

Abstract: Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.

[437] Enhancing Noise Robustness of Parkinson’s Disease Telemonitoring via Contrastive Feature Augmentation

Ziming Tang, Chengbin Hou, Tianyu Zhang, Bangxu Tian, Jinbao Wang, Hairong Lv

Main category: cs.LG

TL;DR: NoRo is a noise-robust framework that enhances UPDRS prediction for Parkinson’s disease telemonitoring by using contrastive learning to generate robust features from speech data, addressing measurement inaccuracies, environmental noise, and data transmission issues.

Details

Motivation: Parkinson's disease telemonitoring faces challenges from three types of noise during at-home UPDRS assessments: patient-induced inaccuracies, environmental noise, and data packet loss, which increase prediction errors.

Method: NoRo groups speech features into ordered bins based on continuous feature values to create contrastive pairs, trains a multilayer perceptron encoder for noise-robust features, and concatenates these with original features as augmented input for UPDRS prediction models.

Result: Extensive experiments with customizable noise injection show that NoRo successfully enhances noise robustness of UPDRS prediction across various downstream models under different noisy environments.

Conclusion: The proposed NoRo framework effectively addresses noise challenges in PD telemonitoring by generating robust features through contrastive learning, improving the reliability of at-home UPDRS assessments.

Abstract: Parkinson’s disease (PD) is one of the most common neurodegenerative disorder. PD telemonitoring emerges as a novel assessment modality enabling self-administered at-home tests of Unified Parkinson’s Disease Rating Scale (UPDRS) scores, enhancing accessibility for PD patients. However, three types of noise would occur during measurements: (1) patient-induced measurement inaccuracies, (2) environmental noise, and (3) data packet loss during transmission, resulting in higher prediction errors. To address these challenges, NoRo, a noise-robust UPDRS prediction framework is proposed. First, the original speech features are grouped into ordered bins, based on the continuous values of a selected feature, to construct contrastive pairs. Second, the contrastive pairs are employed to train a multilayer perceptron encoder for generating noise-robust features. Finally, these features are concatenated with the original features as the augmented features, which are then fed into the UPDRS prediction models. Notably, we further introduces a novel evaluation approach with customizable noise injection module, and extensive experiments show that NoRo can successfully enhance the noise robustness of UPDRS prediction across various downstream prediction models under different noisy environments.

[438] Securing generative artificial intelligence with parallel magnetic tunnel junction true randomness

Youwei Bao, Shuhan Yang, Hyunsoo Yang

Main category: cs.LG

TL;DR: The paper proposes using hardware-generated true random bits from spin-transfer torque magnetic tunnel junctions (STT-MTJs) to replace deterministic pseudo random number generators in generative AI models, achieving improved security with minimal overhead.

Details

Motivation: Deterministic PRNGs in generative AI models create predictable patterns vulnerable to attacks, and conventional defenses have significant energy and latency overhead.

Method: Embed hardware-generated true random bits from STT-MTJs in a highly parallel FPGA-assisted computing system, integrating these random bits into generative adversarial networks.

Result: The system delivers megabit-per-second true random numbers passing NIST tests, reduces insecure outputs by up to 18.6 times compared to low-quality RNG baseline, and can potentially scale to gigabit-per-second throughput for large language models.

Conclusion: STT-MTJ-based spintronic RNGs are practical security components for next-generation generative AI systems, offering nanosecond switching speed, high energy efficiency, and established scalability.

Abstract: Deterministic pseudo random number generators (PRNGs) used in generative artificial intelligence (GAI) models produce predictable patterns vulnerable to exploitation by attackers. Conventional defences against the vulnerabilities often come with significant energy and latency overhead. Here, we embed hardware-generated true random bits from spin-transfer torque magnetic tunnel junctions (STT-MTJs) to address the challenges. A highly parallel, FPGA-assisted prototype computing system delivers megabit-per-second true random numbers, passing NIST randomness tests after in-situ operations with minimal overhead. Integrating the hardware random bits into a generative adversarial network (GAN) trained on CIFAR-10 reduces insecure outputs by up to 18.6 times compared to the low-quality random number generators (RNG) baseline. With nanosecond switching speed, high energy efficiency, and established scalability, our STT-MTJ-based system holds the potential to scale beyond 106 parallel cells, achieving gigabit-per-second throughput suitable for large language model sampling. This advancement highlights spintronic RNGs as practical security components for next-generation GAI systems.

[439] Posterior Collapse as a Phase Transition in Variational Autoencoders

Zhen Li, Fan Zhang, Zheng Zhang, Yu Chen

Main category: cs.LG

TL;DR: Posterior collapse in VAEs is a phase transition governed by data structure and hyper-parameters, characterized by a critical threshold in KL divergence between posterior and prior.

Details

Motivation: To understand posterior collapse in VAEs from a statistical physics perspective, revealing it as a phase transition rather than just optimization failure.

Method: Analyzed stability of trivial solution associated with posterior collapse, identified critical hyper-parameter threshold, and validated on synthetic and real datasets.

Result: Identified a critical boundary separating meaningful latent inference from collapse, characterized by discontinuity in KL divergence, confirming phase transition behavior.

Conclusion: Posterior collapse is an emerging phase transition from interplay between data structure and variational constraints, offering new insights into deep generative model trainability.

Abstract: We investigate the phenomenon of posterior collapse in variational autoencoders (VAEs) from the perspective of statistical physics, and reveal that it constitutes a phase transition governed jointly by data structure and model hyper-parameters. By analyzing the stability of the trivial solution associated with posterior collapse, we identify a critical hyper-parameter threshold. This critical boundary, separating meaningful latent inference from collapse, is characterized by a discontinuity in the KL divergence between the approximate posterior and the prior distribution. We validate this critical behavior on both synthetic and real-world datasets, confirming the existence of a phase transition. Our results demonstrate that posterior collapse is not merely an optimization failure, but rather an emerging phase transition arising from the interplay between data structure and variational constraints. This perspective offers new insights into the trainability and representational capacity of deep generative models.

[440] Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Feiyang Kang, Michael Kuchnik, Karthik Padthe, Marin Vlastelica, Ruoxi Jia, Carole-Jean Wu, Newsha Ardalani

Main category: cs.LG

TL;DR: High SFT scores don’t reliably predict RL performance gains; generalization loss and Pass@large k are better proxies for RL outcomes.

Details

Motivation: Challenge the assumption that high SFT scores translate to improved RL performance, as current practice trains LLMs in two independent stages (SFT then RL).

Method: Trained hundreds of models up to 12B parameters with SFT and RLVR via GRPO, evaluated on 7 math benchmarks with up to 256 repetitions using $>1M GPU hours.

Result: Found that RL training on models with improved SFT performance can lead to worse outcomes than RL on base models without SFT. Generalization loss and Pass@large k metrics improved prediction accuracy (R² and Spearman correlation by up to 0.5, 2x improvement).

Conclusion: Alternative metrics like generalization loss and Pass@large k provide better proxies for RL outcomes than pre-RL SFT scores, with practical implications for training strategies.

Abstract: In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL’’ below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman’s rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.

[441] Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, Carole-Jean Wu

Main category: cs.LG

TL;DR: Synthetic data alone isn’t better than natural web data for LLM pre-training, but mixing 1/3 rephrased synthetic with 2/3 natural data can speed up training 5-10x at larger data budgets. Textbook-style synthetic data performs poorly, especially at small budgets.

Details

Motivation: High-quality training data is limited for LLM scaling, and synthetic data offers a potential solution, but its effectiveness needs systematic evaluation.

Method: Large-scale empirical investigation (>1000 LLMs, >100k GPU hours) using unified protocol and scaling laws to compare natural web data, various synthetic types (rephrased text, generated textbooks), and mixtures.

Result: Rephrased synthetic data alone performs similarly to natural data; 1/3 rephrased synthetic + 2/3 natural speeds training 5-10x; textbook synthetic data causes higher loss; optimal synthetic ratio converges to ~30%; larger generators don’t necessarily yield better data.

Conclusion: Synthetic data has conditional benefits - rephrased synthetic shows no model collapse while textbook synthetic shows collapse patterns. Work provides practical guidance for synthetic data use in pre-training.

Abstract: Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. “Good” ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on “model collapse” during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse”. Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

[442] CAT: Curvature-Adaptive Transformers for Geometry-Aware Learning

Ryan Y. Lin, Siddhartha Ojha, Nicholas Bai

Main category: cs.LG

TL;DR: CAT is a transformer architecture that dynamically routes tokens across Euclidean, hyperbolic, and spherical attention branches using a learnable gating mechanism, enabling adaptive geometric specialization for mixed-geometry data.

Details

Motivation: Standard transformers assume Euclidean geometry, which limits effectiveness on non-Euclidean data. Existing hyperbolic and spherical extensions require committing to a single geometry, reducing flexibility for data with mixed geometric properties.

Method: Introduces a curvature-adaptive transformer with three geometric attention branches (Euclidean, hyperbolic, spherical) and a lightweight differentiable gating mechanism that learns per-token routing based on local relational structure.

Result: On knowledge graph completion benchmarks (FB15k-237, WN18RR), CAT achieves ~10% improvements in MRR and Hits@10 over fixed-geometry baselines with minimal overhead (5% parameter increase, comparable inference time).

Conclusion: Learned geometric adaptation outperforms any single fixed geometry for complex relational reasoning, establishing CAT as a scalable and interpretable foundation for mixture-of-geometry architectures across domains.

Abstract: Transformers achieve strong performance across diverse domains but implicitly assume Euclidean geometry in their attention mechanisms, limiting their effectiveness on data with non-Euclidean structure. While recent extensions to hyperbolic and spherical spaces show promise for hierarchical and cyclical patterns, respectively, they require committing to a single geometry a priori, reducing flexibility when data exhibits mixed geometric properties. We introduce the Curvature-Adaptive Transformer (CAT), a novel architecture that dynamically learns per-token routing across three geometric attention branches through a lightweight, differentiable gating mechanism. Unlike fixed-geometry approaches, CAT enables adaptive geometric specialization, routing tokens to the appropriate curvature based on their local relational structure. The routing network provides interpretable curvature preferences while each branch employs geometry-specific operations optimized for its respective manifold. On knowledge graph completion benchmarks (FB15k-237, WN18RR), CAT achieves approximately 10% improvements in MRR and Hits@10 over fixed-geometry baselines with minimal overhead (5% parameter increase, comparable inference time). These results demonstrate that learned geometric adaptation outperforms any single fixed geometry for complex relational reasoning, establishing CAT as a scalable and interpretable foundation for mixture-of-geometry architectures across language, vision, and multimodal domains.

[443] Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif, Andrea J. Goldsmith, Mengdi Wang

Main category: cs.LG

TL;DR: A combinatorial pattern-based watermarking framework for detecting and localizing post-generation edits in LLM outputs, using vocabulary partitioning and statistical analysis.

Details

Motivation: Real-world LLM-generated content often undergoes post-generation edits (human revisions or spoofing attacks), making it critical to detect and localize such modifications in watermarked text.

Method: Partitions vocabulary into disjoint subsets and embeds watermark by enforcing deterministic combinatorial patterns during generation. Uses global statistics for detection and lightweight local statistics for edit localization.

Result: Strong empirical performance in edit localization across various editing scenarios on open-source LLMs, with evaluation using Type-I error rate and detection accuracy metrics.

Conclusion: The proposed combinatorial watermarking framework effectively addresses the challenge of detecting and localizing post-generation edits in watermarked LLM outputs.

Abstract: Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.

[444] Support Basis: Fast Attention Beyond Bounded Entries

Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang

Main category: cs.LG

TL;DR: The paper introduces support-basis decomposition for efficient attention approximation in LLMs, overcoming the bounded-entry limitation of previous methods by leveraging sub-Gaussian properties of query/key matrices.

Details

Motivation: The quadratic complexity of softmax attention limits LLM scaling, and existing sub-quadratic methods like Alman and Song's work under restrictive bounded-entry assumptions that don't hold in practice.

Method: Proposes support-basis decomposition that splits query/key entries into large and small components using sub-Gaussian properties, enabling exact computation on sparse parts and polynomial approximation on dense parts, with multi-threshold extensions.

Result: Achieves sub-quadratic runtime with rigorous theoretical guarantees, extends to eliminate distributional assumptions, and provides theoretical justification for polynomial attention methods.

Conclusion: The framework enables practical sub-quadratic attention approximation for modern LLMs without restrictive assumptions, bridging theory and empirical success of polynomial attention methods.

Abstract: The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (LLMs). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern LLMs is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on sparse components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.

[445] Source-Free Cross-Domain Continual Learning

Muhammad Tanzil Furqon, Mahardhika Pratama, Igor Škrjanc, Lin Liu, Habibullah Habibullah, Kutluyil Dogancay

Main category: cs.LG

TL;DR: REFEREE addresses source-free cross-domain continual learning by combining a source-pre-trained model with a vision-language model, using frequency-aware prompting and uncertainty-aware weighting to handle domain shifts and noisy pseudo labels, while preventing catastrophic forgetting with kernel linear discriminant analysis.

Details

Motivation: Existing cross-domain continual learning methods require fully labeled source domains, which is impractical in privacy-constrained environments. This paper tackles the more challenging problem of source-free cross-domain continual learning where source-domain samples are completely prohibited.

Method: REFEREE combines a source-pre-trained model with a large-scale vision-language model. It uses frequency-aware prompting to handle domain shifts, uncertainty-aware weighting to mitigate noisy pseudo labels, and kernel linear discriminant analysis (KLDA) to prevent catastrophic forgetting while keeping the backbone network frozen.

Result: The approach significantly outperforms prior methods that have access to source domain samples, demonstrating its effectiveness in source-free cross-domain continual learning scenarios.

Conclusion: REFEREE successfully addresses the challenges of source-free cross-domain continual learning through its novel combination of frequency-aware prompting, uncertainty-aware weighting, and KLDA, achieving state-of-the-art performance without requiring source domain samples.

Abstract: Although existing cross-domain continual learning approaches successfully address many streaming tasks having domain shifts, they call for a fully labeled source domain hindering their feasibility in the privacy constrained environments. This paper goes one step ahead with the problem of source-free cross-domain continual learning where the use of source-domain samples are completely prohibited. We propose the idea of rehearsal-free frequency-aware dynamic prompt collaborations (REFEREE) to cope with the absence of labeled source-domain samples in realm of cross-domain continual learning. REFEREE is built upon a synergy between a source-pre-trained model and a large-scale vision-language model, thus overcoming the problem of sub-optimal generalizations when relying only on a source pre-trained model. The domain shift problem between the source domain and the target domain is handled by a frequency-aware prompting technique encouraging low-frequency components while suppressing high-frequency components. This strategy generates frequency-aware augmented samples, robust against noisy pseudo labels. The noisy pseudo-label problem is further addressed with the uncertainty-aware weighting strategy where the mean and covariance matrix are weighted by prediction uncertainties, thus mitigating the adverse effects of the noisy pseudo label. Besides, the issue of catastrophic forgetting (CF) is overcome by kernel linear discriminant analysis (KLDA) where the backbone network is frozen while the classification is performed using the linear discriminant analysis approach guided by the random kernel method. Our rigorous numerical studies confirm the advantage of our approach where it beats prior arts having access to source domain samples with significant margins.

[446] The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM

Kwanhee Lee, Hyeondo Jang, Dongyeop Lee, Dan Alistarh, Namhoon Lee

Main category: cs.LG

TL;DR: Elsa is a novel neural network pruning method that achieves extreme sparsity (up to 90%) in large language models while maintaining high accuracy, overcoming limitations of conventional methods that plateau at 50-60% sparsity.

Details

Motivation: Current neural network pruning methods for LLMs face diminishing returns and cannot surpass moderate sparsity levels (50-60%) without significant accuracy degradation, creating a need for more effective approaches.

Method: Elsa uses principled constrained optimization techniques based on ADMM to directly address limitations in current practice, avoiding reliance on surrogate objective formulations. It also includes a quantized variant (Elsa-L) for scaling to extremely large models.

Result: Elsa achieves substantial improvements over existing methods, including 7.8× less perplexity than the best existing method on LLaMA-2-7B at 90% sparsity, and scales to 27B models with theoretical convergence guarantees.

Conclusion: The work represents meaningful progress in advancing LLM sparsity frontiers and suggests significant opportunities remain in unexplored directions for further advancement.

Abstract: Neural network pruning is a promising technique to mitigate the excessive computational and memory requirements of large language models (LLMs). Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate sparsity levels (50-60%) without severely degrading model accuracy. This work breaks through the current impasse, presenting a principled and effective method called $\texttt{Elsa}$, which achieves extreme sparsity levels of up to 90% while retaining high model fidelity. This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation. $\texttt{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM. Our extensive experiments across a wide range of models and scales show that $\texttt{Elsa}$ achieves substantial improvements over existing methods; e.g., it achieves 7.8$\times$ less perplexity than the best existing method on LLaMA-2-7B at 90% sparsity. Furthermore, we present $\texttt{Elsa}_{\text{-L}}$, a quantized variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees. These results highlight meaningful progress in advancing the frontier of LLM sparsity, while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.

[447] Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan

Main category: cs.LG

TL;DR: AsyPPO introduces lightweight mini-critics trained on disjoint prompt shards to restore critics in RL for LLMs, improving value estimation and policy updates while maintaining efficiency at scale.

Details

Motivation: Conventional value functions are computationally expensive for LLM-scale training and fail under sparse rewards and long reasoning horizons, leading recent methods to avoid explicit critics. AsyPPO aims to restore critics' role while remaining efficient.

Method: AsyPPO uses a set of lightweight mini-critics trained on disjoint prompt shards to encourage diversity and reduce value-estimation bias. It leverages inter-critic uncertainty to mask advantages in states with critic agreement and filter high-divergence states from entropy regularization.

Result: After training on only 5,000 samples, AsyPPO consistently improves learning stability and performance across benchmarks, achieving over 6% gains on Qwen3-4b-Base and about 3% gains on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO.

Conclusion: AsyPPO demonstrates the importance of architectural innovations for scalable, efficient RL algorithms for LLMs, showing that properly designed critics can significantly improve performance without additional computational tricks.

Abstract: Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

[448] Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing

Amin Jalali, Milad Soltany, Michael Greenspan, Ali Etemad

Main category: cs.LG

TL;DR: TimeHUT is a novel time-series representation learning method that uses hierarchical uniformity-tolerance balancing with contrastive learning and angular margin loss to capture both instance-wise and temporal information.

Details

Motivation: To learn strong time-series representations by effectively balancing uniformity and tolerance in the embedding space, capturing both instance-level and temporal dependencies.

Method: Uses hierarchical setup with two losses: temperature-scheduled contrastive loss for uniformity-tolerance balancing, and hierarchical angular margin loss to enforce geometric margins between positive/negative temporal sequence pairs.

Result: Outperforms prior methods by considerable margins on 128 UCR and 30 UAE datasets for classification, and achieves competitive results on Yahoo and KPI datasets for anomaly detection.

Conclusion: TimeHUT effectively balances uniformity and tolerance in time-series representations, demonstrating superior performance in classification tasks and competitive results in anomaly detection.

Abstract: We propose TimeHUT, a novel method for learning time-series representations by hierarchical uniformity-tolerance balancing of contrastive representations. Our method uses two distinct losses to learn strong representations with the aim of striking an effective balance between uniformity and tolerance in the embedding space. First, TimeHUT uses a hierarchical setup to learn both instance-wise and temporal information from input time-series. Next, we integrate a temperature scheduler within the vanilla contrastive loss to balance the uniformity and tolerance characteristics of the embeddings. Additionally, a hierarchical angular margin loss enforces instance-wise and temporal contrast losses, creating geometric margins between positive and negative pairs of temporal sequences. This approach improves the coherence of positive pairs and their separation from the negatives, enhancing the capture of temporal dependencies within a time-series sample. We evaluate our approach on a wide range of tasks, namely 128 UCR and 30 UAE datasets for univariate and multivariate classification, as well as Yahoo and KPI datasets for anomaly detection. The results demonstrate that TimeHUT outperforms prior methods by considerable margins on classification, while obtaining competitive results for anomaly detection. Finally, detailed sensitivity and ablation studies are performed to evaluate different components and hyperparameters of our method.

[449] Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu

Main category: cs.LG

TL;DR: ShapKAN is a pruning framework for Kolmogorov-Arnold Networks (KANs) that uses Shapley values to assess node importance in a shift-invariant manner, overcoming limitations of traditional magnitude-based pruning methods.

Details

Motivation: KANs provide interpretable neural networks but face unique pruning challenges due to their spline-based activation functions. Traditional magnitude-based pruning methods become unreliable for KANs because they are sensitive to input coordinate shifts, making it difficult to preserve true functional relationships during compression.

Method: The proposed ShapKAN framework uses Shapley value attribution to quantify each node’s actual contribution to the network output. This approach provides shift-invariant importance rankings, ensuring consistent pruning decisions regardless of input parameterization.

Result: Extensive experiments on synthetic and real-world datasets show that ShapKAN effectively preserves true node importance while enabling network compression. The method maintains KAN’s interpretability advantages while facilitating deployment in resource-constrained environments.

Conclusion: ShapKAN provides a robust pruning solution for KANs that overcomes the limitations of traditional magnitude-based methods, ensuring consistent importance assessment and enabling effective network compression without sacrificing interpretability.

Abstract: For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov–Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN’s architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node’s actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN’s interpretability advantages, facilitating deployment in resource-constrained environments.

[450] Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis

Han Wu, Yanming Sun, Yunhe Yang, Derek F. Wong

Main category: cs.LG

TL;DR: AGFN is an adaptive gated fusion network that uses dual gate mechanism based on information entropy and modality importance to handle noisy/missing modalities and improve multimodal sentiment analysis.

Details

Motivation: Simple fusion techniques fail to account for variations in modality quality (noisy, missing, conflicting), leading to suboptimal performance in discerning subtle emotional nuances.

Method: Dual gate fusion mechanism that adaptively adjusts feature weights using information entropy and modality importance, mitigating noisy modalities and prioritizing informative cues after unimodal encoding and cross-modal interaction.

Result: Significantly outperforms strong baselines on CMU-MOSI and CMU-MOSEI datasets in accuracy, effectively discerning subtle emotions with robust performance.

Conclusion: AGFN enhances generalization by learning from broader feature distribution, reducing correlation between feature location and prediction error, creating more robust multimodal feature representations.

Abstract: Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.

[451] PASTA: A Unified Framework for Offline Assortment Learning

Juncheng Dong, Weibin Mo, Zhengling Qi, Cong Shi, Ethan X. Fang, Vahid Tarokh

Main category: cs.LG

TL;DR: PASTA is a pessimistic assortment optimization framework for data-driven settings that achieves optimal revenue under general choice models without requiring full data coverage of all assortments.

Details

Motivation: Firms lack prior knowledge of choice models and face insufficient data coverage in assortment optimization, making it challenging to design effective solutions from historical customer choice data.

Method: Proposes Pessimistic Assortment Optimization (PASTA) framework that leverages pessimism principle, requiring only that offline data contains an optimal assortment rather than full coverage of all feasible assortments.

Result: Establishes first finite-sample regret bounds for offline assortment optimization across multinomial logit and nested logit models, proves PASTA is minimax optimal in sample and model complexity, and shows superior performance over baselines in experiments.

Conclusion: PASTA provides an effective data-driven solution for assortment optimization that addresses data insufficiency issues and achieves theoretical optimality guarantees across various choice models.

Abstract: We study a broad class of assortment optimization problems in an offline and data-driven setting. In such problems, a firm lacks prior knowledge of the underlying choice model, and aims to determine an optimal assortment based on historical customer choice data. The combinatorial nature of assortment optimization often results in insufficient data coverage, posing a significant challenge in designing provably effective solutions. To address this, we introduce a novel Pessimistic Assortment Optimization (PASTA) framework that leverages the principle of pessimism to achieve optimal expected revenue under general choice models. Notably, PASTA requires only that the offline data distribution contains an optimal assortment, rather than providing the full coverage of all feasible assortments. Theoretically, we establish the first finite-sample regret bounds for offline assortment optimization across several widely used choice models, including the multinomial logit and nested logit models. Additionally, we derive a minimax regret lower bound, proving that PASTA is minimax optimal in terms of sample and model complexity. Numerical experiments further demonstrate that our method outperforms existing baseline approaches.

[452] Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport

Shaan Shah, Meenakshi Khosla

Main category: cs.LG

TL;DR: HOT (Hierarchical Optimal Transport) is a unified framework for comparing neural network representations that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans, handling depth mismatches naturally.

Details

Motivation: Standard representational similarity methods have limitations: they align layers independently (producing asymmetric results), lack global alignment scores, and struggle with networks of different depths due to ignoring global activation structure and rigid one-to-one layer mappings.

Method: HOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. It infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans through optimal transport principles.

Result: HOT matches or surpasses standard pairwise matching in alignment quality across vision models, LLMs, and human visual cortex recordings. It reveals smooth hierarchical correspondences: early-to-early layer mappings, deeper layers maintaining relative positions, and depth mismatches resolved by distributing representations across layers.

Conclusion: HOT enables richer, more interpretable comparisons between representations, particularly for networks with different architectures or depths, as structured patterns emerge naturally from global optimization without being imposed.

Abstract: Standard representational similarity methods align each layer of a network to its best match in another independently, producing asymmetric results, lacking a global alignment score, and struggling with networks of different depths. These limitations arise from ignoring global activation structure and restricting mappings to rigid one-to-one layer correspondences. We propose Hierarchical Optimal Transport (HOT), a unified framework that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans. HOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. This yields both a single alignment score for the entire network comparison and a soft transport plan that naturally handles depth mismatches through mass distribution. We evaluate HOT on vision models, large language models, and human visual cortex recordings. Across all domains, HOT matches or surpasses standard pairwise matching in alignment quality. Moreover, it reveals smooth, fine-grained hierarchical correspondences: early layers map to early layers, deeper layers maintain relative positions, and depth mismatches are resolved by distributing representations across multiple layers. These structured patterns emerge naturally from global optimization without being imposed, yet are absent in greedy layer-wise methods. HOT thus enables richer, more interpretable comparisons between representations, particularly when networks differ in architecture or depth.

[453] ActiNet: Activity intensity classification of wrist-worn accelerometers using self-supervised deep learning

Aidan Acquah, Shing Chan, Aiden Doherty

Main category: cs.LG

TL;DR: ActiNet, a self-supervised deep learning model with HMM smoothing, outperforms traditional random forest + HMM for human activity recognition from wrist accelerometer data, achieving better classification of activity intensity levels.

Details

Motivation: To improve human activity recognition (HAR) models for epidemiological studies that investigate the association between physical activity and health outcomes using passively collected wrist-accelerometer data.

Method: Trained ActiNet (18-layer modified ResNet-V2 model) with self-supervised learning and hidden Markov model smoothing on 151 CAPTURE-24 participants’ data, compared against baseline random forest + HMM using 5-fold stratified group cross-validation.

Result: ActiNet achieved mean macro F1 score of 0.82 and Cohen’s kappa score of 0.86, outperforming RF+HMM (0.77 F1, 0.81 kappa). Performance improvements were consistent across age and sex subgroups.

Conclusion: ActiNet is recommended for extracting activity intensity labels from wrist-accelerometer data in future epidemiological studies due to its superior performance over traditional methods.

Abstract: The use of reliable and accurate human activity recognition (HAR) models on passively collected wrist-accelerometer data is essential in large-scale epidemiological studies that investigate the association between physical activity and health outcomes. While the use of self-supervised learning has generated considerable excitement in improving HAR, it remains unknown the extent to which these models, coupled with hidden Markov models (HMMs), would make a tangible improvement to classification performance, and the effect this may have on the predicted daily activity intensity compositions. Using 151 CAPTURE-24 participants’ data, we trained the ActiNet model, a self-supervised, 18-layer, modified ResNet-V2 model, followed by hidden Markov model (HMM) smoothing to classify labels of activity intensity. The performance of this model, evaluated using 5-fold stratified group cross-validation, was then compared to a baseline random forest (RF) + HMM, established in existing literature. Differences in performance and classification outputs were compared with different subgroups of age and sex within the Capture-24 population. The ActiNet model was able to distinguish labels of activity intensity with a mean macro F1 score of 0.82, and mean Cohen’s kappa score of 0.86. This exceeded the performance of the RF + HMM, trained and validated on the same dataset, with mean scores of 0.77 and 0.81, respectively. These findings were consistent across subgroups of age and sex. These findings encourage the use of ActiNet for the extraction of activity intensity labels from wrist-accelerometer data in future epidemiological studies.

[454] Latency-aware Multimodal Federated Learning over UAV Networks

Shaba Shaon, Dinh C. Nguyen

Main category: cs.LG

TL;DR: This paper proposes a federated multimodal learning framework assisted by UAVs to minimize system latency, with theoretical convergence analysis and efficient optimization algorithms.

Details

Motivation: To overcome limitations of unimodal systems and enhance model accuracy/generalization in UAV networks by leveraging multimodal sensing while minimizing system latency.

Method: Proposes an iterative optimization algorithm combining block coordinate descent and successive convex approximation to jointly optimize UAV sensing scheduling, power control, trajectory planning, resource allocation, and BS resource management.

Result: Numerical experiments show the proposed FML framework outperforms existing approaches in system latency and model training performance under different data settings.

Conclusion: The UAV-assisted FML framework with efficient optimization algorithms successfully minimizes system latency while providing theoretical convergence guarantees and improved multimodal learning performance.

Abstract: This paper investigates federated multimodal learning (FML) assisted by unmanned aerial vehicles (UAVs) with a focus on minimizing system latency and providing convergence analysis. In this framework, UAVs are distributed throughout the network to collect data, participate in model training, and collaborate with a base station (BS) to build a global model. By utilizing multimodal sensing, the UAVs overcome the limitations of unimodal systems, enhancing model accuracy, generalization, and offering a more comprehensive understanding of the environment. The primary objective is to optimize FML system latency in UAV networks by jointly addressing UAV sensing scheduling, power control, trajectory planning, resource allocation, and BS resource management. To address the computational complexity of our latency minimization problem, we propose an efficient iterative optimization algorithm combining block coordinate descent and successive convex approximation techniques, which provides high-quality approximate solutions. We also present a theoretical convergence analysis for the UAV-assisted FML framework under a non-convex loss function. Numerical experiments demonstrate that our FML framework outperforms existing approaches in terms of system latency and model training performance under different data settings.

[455] Accelerating Attention with Basis Decomposition

Jialin Zhao

Main category: cs.LG

TL;DR: BD Attention (BDA) is a lossless algorithmic reformulation of attention using Basis Decomposition, providing mathematically guaranteed acceleration without changing model outputs.

Details

Motivation: Attention is computationally expensive in large language models and vision-language models, and existing optimizations like FlashAttention are system-level improvements rather than algorithmic breakthroughs.

Method: Uses Basis Decomposition matrix identity to restructure multi-head projections into a compact form while preserving exact outputs, requiring only 4s of offline preparation with no retraining.

Result: Achieves 32% faster key/value projections, 25% smaller weights, and increases perplexity by only 0.02% (FP16) or 0.0004% (FP32), with negligible impact on model performance.

Conclusion: BDA is the first theoretically exact method for lossless attention acceleration that complements existing engineering optimizations.

Abstract: Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

[456] Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation

Saptarshi Mandal, Yashaswini Murthy, R. Srikant

Main category: cs.LG

TL;DR: First robust TD learning with linear function approximation for distributionally robust RL, achieving O(1/ε²) sample complexity without requiring generative access to MDP.

Details

Motivation: Existing robust TD learning guarantees are limited to tabular MDPs or require restrictive discount-factor assumptions with function approximation. Need model-free robust RL algorithms with theoretical guarantees.

Method: Combines two-time-scale stochastic-approximation update with outer-loop target-network update. Uses total-variation and Wasserstein distance uncertainty sets. Model-free approach without generative access.

Result: Achieves O(1/ε²) sample complexity to obtain ε-accurate value estimate. Extends to robust Q-learning with function approximation.

Conclusion: Closes key gap between empirical success of robust RL algorithms and theoretical guarantees of non-robust counterparts. Provides first robust TD learning with linear function approximation.

Abstract: Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. In particular, we are interested in maximizing the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for robust temporal-difference (TD) learning for policy evaluation are limited to tabular MDPs or are dependent on restrictive discount-factor assumptions when function approximation is used. We present the first robust TD learning with linear function approximation, where robustness is measured with respect to the total-variation distance and Wasserstein-l distance uncertainty set. Additionally, our algorithm is both model-free and does not require generative access to the MDP. Our algorithm combines a two-time-scale stochastic-approximation update with an outer-loop target-network update. We establish an $\tilde{O}(1/\epsilon^2)$ sample complexity to obtain an $\epsilon$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Q-learning with function approximation.

[457] Workplace Location Choice Model based on Deep Neural Network

Tanay Rastogi, Anders Karlström

Main category: cs.LG

TL;DR: This paper compares Deep Neural Networks (DNNs) with traditional Discrete Choice Models (DCMs) for workplace location choice modeling, finding DNNs outperform DCMs in some aspects while DCMs excel at shorter distances and individual attribute analysis.

Details

Motivation: Traditional discrete choice models face challenges in accurately mirroring individual decision-making processes for workplace location choices, prompting the need for more sophisticated modeling approaches.

Method: The study uses deep neural networks (DNNs) as an alternative approach to traditional discrete choice models (DCMs) for modeling workplace location choices.

Result: DNNs show significant potential as a robust alternative to DCMs, outperforming them in certain aspects while DCMs better align with data when assessing individual attributes’ influence on workplace distance. DCMs excel at shorter distances, while DNNs perform comparably for longer distances.

Conclusion: The findings underscore the importance of selecting the appropriate model based on specific application requirements in workplace location choice analysis, suggesting a complementary relationship between DNNs and DCMs rather than one model being universally superior.

Abstract: Discrete choice models (DCMs) have long been used to analyze workplace location decisions, but they face challenges in accurately mirroring individual decision-making processes. This paper presents a deep neural network (DNN) method for modeling workplace location choices, which aims to better understand complex decision patterns and provides better results than traditional discrete choice models (DCMs). The study demonstrates that DNNs show significant potential as a robust alternative to DCMs in this domain. While both models effectively replicate the impact of job opportunities on workplace location choices, the DNN outperforms the DCM in certain aspects. However, the DCM better aligns with data when assessing the influence of individual attributes on workplace distance. Notably, DCMs excel at shorter distances, while DNNs perform comparably to both data and DCMs for longer distances. These findings underscore the importance of selecting the appropriate model based on specific application requirements in workplace location choice analysis.

[458] Private and Fair Machine Learning: Revisiting the Disparate Impact of Differentially Private SGD

Lea Demelius, Dominik Kowald, Simone Kopeinik, Roman Kern, Andreas Trügler

Main category: cs.LG

TL;DR: Differential privacy in neural network training affects fairness, and while direct hyperparameter optimization on private models improves utility-fairness trade-offs, it doesn’t reliably mitigate disparate impacts and increases privacy leakage.

Details

Motivation: To analyze whether optimizing hyperparameters directly on differentially private models can achieve fairness comparable to non-private models, and to understand the disparate impact of DPSGD across different performance metrics.

Method: 1) Compare disparate impact of DPSGD on different performance metrics, 2) Analyze over wide range of hyperparameter settings, 3) Extend analysis to DPSGD-Global-Adapt variant.

Result: Disparate impact on one metric doesn’t imply impact on another; direct hyperparameter optimization improves utility-fairness trade-offs but doesn’t reliably mitigate disparate impacts; DPSGD-Global-Adapt is not robust to hyperparameter choices.

Conclusion: Careful balancing of privacy, utility and fairness is needed as hyperparameter tuning increases privacy leakage, and current methods don’t reliably mitigate DPSGD’s disparate impacts.

Abstract: Differential privacy (DP) is a prominent method for protecting information about individuals during data analysis. Training neural networks with differentially private stochastic gradient descent (DPSGD) influences the model’s learning dynamics and, consequently, its output. This can affect the model’s performance and fairness. While the majority of studies on the topic report a negative impact on fairness, it has recently been suggested that fairness levels comparable to non-private models can be achieved by optimizing hyperparameters for performance directly on differentially private models (rather than re-using hyperparameters from non-private models, as is common practice). In this work, we analyze the generalizability of this claim by 1) comparing the disparate impact of DPSGD on different performance metrics, and 2) analyzing it over a wide range of hyperparameter settings. We highlight that a disparate impact on one metric does not necessarily imply a disparate impact on another. Most importantly, we show that while optimizing hyperparameters directly on differentially private models does not mitigate the disparate impact of DPSGD reliably, it can still lead to improved utility-fairness trade-offs compared to re-using hyperparameters from non-private models. We stress, however, that any form of hyperparameter tuning entails additional privacy leakage, calling for careful considerations of how to balance privacy, utility and fairness. Finally, we extend our analyses to DPSGD-Global-Adapt, a variant of DPSGD designed to mitigate the disparate impact on accuracy, and conclude that this alternative may not be a robust solution with respect to hyperparameter choice.

[459] Learning Regularization Functionals for Inverse Problems: A Comparative Study

Johannes Hertrich, Hok Shing Wong, Alexander Denker, Stanislas Ducotterd, Zhenghan Fang, Markus Haltmeier, Željko Kereta, Erich Kobler, Oscar Leong, Mohammad Sadegh Salehi, Carola-Bibiane Schönlieb, Johannes Schwab, Zakhar Shumaylov, Jeremias Sulam, German Shâma Wache, Martin Zach, Yasi Zhang, Matthias J. Ehrhardt, Sebastian Neumayer

Main category: cs.LG

TL;DR: The paper unifies various learned regularization methods for imaging inverse problems into a common framework to enable systematic comparison and provide practical guidelines.

Details

Motivation: To address the challenge of comparing different learned regularization methods due to their non-modular implementations and varying architectural designs and training strategies.

Method: Collecting and unifying available code for learned regularization frameworks into a common framework, then systematically comparing the approaches.

Result: The unified framework enables systematic comparison of methods, highlighting their strengths and limitations, and provides valuable insights into their future potential.

Conclusion: The paper successfully creates a unified framework for comparing learned regularization methods in imaging inverse problems, offering practical guidelines and insights for future development.

Abstract: In recent years, a variety of learned regularization frameworks for solving inverse problems in imaging have emerged. These offer flexible modeling together with mathematical insights. The proposed methods differ in their architectural design and training strategies, making direct comparison challenging due to non-modular implementations. We address this gap by collecting and unifying the available code into a common framework. This unified view allows us to systematically compare the approaches and highlight their strengths and limitations, providing valuable insights into their future potential. We also provide concise descriptions of each method, complemented by practical guidelines.

[460] Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks

Bruno Corcuera, Carlos Eiras-Franco, Brais Cancela

Main category: cs.LG

TL;DR: A novel unsupervised Dynamic Feature Selection (DFS) method that enhances latent representations by removing noisy or irrelevant features from images, improving model performance and generalization without labeled data.

Details

Motivation: Latent representations in vision tasks often contain noisy or irrelevant features that degrade model performance and generalization capabilities, necessitating a method to filter out misleading information.

Method: Unsupervised Dynamic Feature Selection (DFS) that identifies and removes misleading or redundant information for each instance, ensuring only relevant features contribute to the latent space without requiring labeled data.

Result: Models with unsupervised DFS achieve significant improvements in generalization performance across clustering and image generation tasks, with minimal computational cost increase.

Conclusion: Unsupervised DFS effectively enhances latent representations by filtering out irrelevant features, making it broadly applicable across domains and datasets while maintaining computational efficiency.

Abstract: Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model’s performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.

[461] Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX

Waris Radji, Thomas Michel, Hector Piteau

Main category: cs.LG

TL;DR: Octax is a high-performance suite of classic arcade game environments implemented in JAX, offering GPU-accelerated alternatives to CPU-bound Atari benchmarks for large-scale reinforcement learning experiments.

Details

Motivation: Current RL research lacks diverse, challenging environments that are both tractable and scalable. Modern video games are computationally expensive and poorly suited for large-scale experimentation due to CPU-bound execution.

Method: Implemented CHIP-8 emulation in JAX to create a suite of classic arcade game environments spanning puzzle, action, and strategy genres, all executable on GPUs with perfect fidelity to original game mechanics.

Result: Achieved orders-of-magnitude speedups over traditional CPU emulators while maintaining perfect fidelity. Demonstrated significant improvements in training speed and scalability for RL agents across multiple games.

Conclusion: Octax provides the RL research community with a scalable, GPU-accelerated alternative to Atari benchmarks, enabling large-scale experimentation with modular design for easy extension and novel environment generation.

Abstract: Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable. While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution. We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research. Octax provides the JAX community with a long-awaited end-to-end GPU alternative to the Atari benchmark, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at massive scale on modern GPUs. Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators while maintaining perfect fidelity to the original game mechanics. We demonstrate Octax’s capabilities by training RL agents across multiple games, showing significant improvements in training speed and scalability compared to existing solutions. The environment’s modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation.

[462] Neural non-canonical Hamiltonian dynamics for long-time simulations

Clémentine Courtès, Emmanuel Franck, Michael Kraus, Laurent Navoret, Léopold Trémant

Main category: cs.LG

TL;DR: This paper addresses numerical instability in learning non-canonical Hamiltonian dynamics by proposing two training strategies to overcome gauge dependency issues in variational integrators.

Details

Motivation: Previous approaches to learning Hamiltonian dynamics focused separately on model architecture or numerical schemes, but combining both creates gauge dependency issues that cause numerical instability in long-term simulations.

Method: Two training strategies: 1) directly learning the vector field, and 2) learning time-discrete dynamics through the numerical scheme itself.

Result: The methods successfully learn complex physical dynamics including guiding center dynamics from gyrokinetic plasma physics, overcoming previous numerical instability issues.

Conclusion: The proposed training strategies effectively address gauge dependency problems in combining Hamiltonian model learning with variational integrators, enabling stable long-term simulations of complex physical systems.

Abstract: This work focuses on learning non-canonical Hamiltonian dynamics from data, where long-term predictions require the preservation of structure both in the learned model and in numerical schemes. Previous research focused on either facet, respectively with a potential-based architecture and with degenerate variational integrators, but new issues arise when combining both. In experiments, the learnt model is sometimes numerically unstable due to the gauge dependency of the scheme, rendering long-time simulations impossible. In this paper, we identify this problem and propose two different training strategies to address it, either by directly learning the vector field or by learning a time-discrete dynamics through the scheme. Several numerical test cases assess the ability of the methods to learn complex physical dynamics, like the guiding center from gyrokinetic plasma physics.

[463] Sensitivity, Specificity, and Consistency: A Tripartite Evaluation of Privacy Filters for Synthetic Data Generation

Adil Koeken, Alexander Ziller, Moritz Knolle, Daniel Rueckert

Main category: cs.LG

TL;DR: Post-hoc privacy filtering for synthetic medical datasets shows limited effectiveness, failing to reliably detect near-duplicates and providing false security rather than protecting patient privacy.

Details

Motivation: To rigorously evaluate the effectiveness of post-hoc privacy filtering techniques for synthetic medical datasets, as their claimed privacy protection remains largely unverified.

Method: Applied a filtering pipeline to chest X-ray synthesis and evaluated its performance in detecting personally identifiable information and near-duplicates from training data.

Result: Current filters exhibit limited specificity and consistency, achieving high sensitivity only for real images but failing to reliably detect near-duplicates generated from training data.

Conclusion: Substantial advances in filter design are needed before post-hoc privacy filtering methods can be confidently deployed in sensitive medical applications, as current methods provide false security while leaving patient information exposed.

Abstract: The generation of privacy-preserving synthetic datasets is a promising avenue for overcoming data scarcity in medical AI research. Post-hoc privacy filtering techniques, designed to remove samples containing personally identifiable information, have recently been proposed as a solution. However, their effectiveness remains largely unverified. This work presents a rigorous evaluation of a filtering pipeline applied to chest X-ray synthesis. Contrary to claims from the original publications, our results demonstrate that current filters exhibit limited specificity and consistency, achieving high sensitivity only for real images while failing to reliably detect near-duplicates generated from training data. These results demonstrate a critical limitation of post-hoc filtering: rather than effectively safeguarding patient privacy, these methods may provide a false sense of security while leaving unacceptable levels of patient information exposed. We conclude that substantial advances in filter design are needed before these methods can be confidently deployed in sensitive applications.

[464] Rethinking the shape convention of an MLP

Meng-Hsi Chen, Yu-Ang Lee, Feng-Ting Liao, Da-shan Shiu

Main category: cs.LG

TL;DR: The paper proposes Hourglass MLP blocks with wide-narrow-wide design, where skip connections operate at expanded dimensions and computation flows through narrow bottlenecks, achieving better performance-parameter trade-offs than conventional narrow-wide-narrow MLPs.

Details

Motivation: To challenge the conventional narrow-wide-narrow design of MLPs by inverting the skip connection placement, leveraging higher-dimensional spaces for refinement while maintaining computational efficiency.

Method: Propose Hourglass MLP blocks with wide-narrow-wide architecture, where initial projection lifts inputs to expanded dimensions and can remain fixed at random initialization. Evaluate on generative tasks through systematic architectural search.

Result: Hourglass architectures consistently achieve superior performance-parameter Pareto frontiers compared to conventional designs, with optimal configurations favoring deeper networks, wider skip connections, and narrower bottlenecks as parameters increase.

Conclusion: The findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

Abstract: Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.

[465] Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

Adam Filipek

Main category: cs.LG

TL;DR: Sparse Query Attention (SQA) reduces computational complexity by decreasing Query heads instead of Key/Value heads, achieving up to 3x throughput improvements for long sequences with minimal quality impact.

Details

Motivation: The quadratic complexity of Multi-Head Attention limits scaling for long contexts. Existing solutions like MQA and GQA address memory bandwidth but don't reduce FLOPs for attention computation, which remains a bottleneck for training and full-sequence processing.

Method: SQA reduces the number of Query heads rather than Key/Value heads, directly decreasing attention computational complexity proportionally to query head reduction. The paper presents theoretical foundation, mathematical formulation, and architectural variants.

Result: Empirical benchmarks on long sequences (32k-200k tokens) show SQA achieves up to 3x throughput improvements in computation-bound scenarios like pre-training, fine-tuning, and encoder tasks, with minimal impact on model quality in preliminary experiments.

Conclusion: SQA provides a complementary optimization path to existing attention mechanisms, offering significant computational efficiency gains for scalable model development, and was discovered during Reactive Transformer development.

Abstract: The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with respect to sequence length presents a significant barrier to scaling, particularly for applications involving long contexts. Prevailing solutions, such as Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), have effectively addressed the memory bandwidth bottleneck that dominates autoregressive inference latency by sharing Key and Value projections. While highly successful, these methods do not reduce the fundamental number of floating-point operations (FLOPs) required for the attention score computation, which remains a critical bottleneck for training and full-sequence processing. This paper introduces Sparse Query Attention (SQA), a novel attention architecture that pursues an alternative and complementary optimization path. Instead of reducing Key/Value heads, SQA reduces the number of Query heads. This architectural modification directly decreases the computational complexity of the attention mechanism by a factor proportional to the reduction in query heads, thereby lowering the overall FLOPs. This work presents the theoretical foundation of SQA, its mathematical formulation, and a family of architectural variants. Empirical benchmarks on long sequences (32k-200k tokens) demonstrate that SQA can achieve significant throughput improvements of up to 3x in computation-bound scenarios such as model pre-training, fine-tuning, and encoder-based tasks, with only a minimal impact on model quality in preliminary smallscale experiments. SQA was discovered serendipitously during the development of the upcoming Reactive Transformer architecture, suggesting its potential as a powerful tool for building more efficient and scalable models

[466] Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets

Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, Sachin Sharma, John D. Kelleher

Main category: cs.LG

TL;DR: This paper proposes using pre-hoc model selection with traditional models and LLM agents to reduce AutoML search space, achieving computational efficiency while maintaining model performance.

Details

Motivation: Current AutoML methods rely on exhaustive hyperparameter searches which are computationally expensive. Pre-hoc prediction offers a promising alternative but remains under-explored in literature.

Method: Leverage traditional models and LLM agents using dataset descriptions and statistical information to pre-select models, reducing the AutoML search space. Tested on AWS AutoGluon portfolio with 175 tabular classification datasets.

Result: The approach significantly reduces computational overhead while still selecting the best performing models for given datasets.

Conclusion: Pre-hoc model selection represents a shift in AutoML workflows that balances computational efficiency with model performance, offering a practical alternative to exhaustive search methods.

Abstract: The field of AutoML has made remarkable progress in post-hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre-hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre-selection of models. Despite its potential, pre-hoc prediction remains under-explored in the literature. This paper explores the intersection of AutoML and pre-hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state-of-the-art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.

[467] $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai

Main category: cs.LG

TL;DR: Proposes G²RPO framework for better alignment of flow models with human preferences using granular reward assessments and multi-scale advantage integration.

Details

Motivation: Existing RL methods for diffusion/flow models suffer from sub-optimal preference alignment due to sparse and narrow reward signals during stochastic sampling.

Method: Introduces Singular Stochastic Sampling for step-wise exploration with high reward-noise correlation, and Multi-Granularity Advantage Integration to eliminate bias from fixed-granularity denoising.

Result: G²RPO significantly outperforms existing flow-based GRPO baselines in both in-domain and out-of-domain evaluations across various reward models.

Conclusion: The framework achieves precise and comprehensive reward assessments, demonstrating effectiveness and robustness in aligning generative models with human preferences.

Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO ($\text{G}^2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our $\text{G}^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

[468] Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

Olivier Goudet, Quentin Suire, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier

Main category: cs.LG

TL;DR: Order-invariant RL for black-box combinatorial optimization using random generation orders during training to improve sample efficiency and avoid dependency on fixed variable orderings.

Details

Motivation: Classical EDAs rely on learning explicit variable dependency graphs, which can be costly and inefficient for capturing complex interactions. Need for more efficient and robust optimization methods.

Method: Multivariate autoregressive generative model trained without fixed variable ordering, using random generation orders as information-preserving dropout. Adapted GRPO for stable policy-gradient updates with scale-invariant advantages.

Result: Achieved best performance across various benchmark algorithms and problem instances, consistently avoiding catastrophic failures.

Conclusion: Order-invariant RL framework provides efficient and robust combinatorial optimization by promoting search-space diversity and focusing on relevant variable dependencies.

Abstract: We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training - a form of information-preserving dropout - the model is encouraged to be invariant to variable order, promoting search-space diversity and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Generalized Reinforcement Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

[469] Learning Representations Through Contrastive Neural Model Checking

Vladimir Krsmanovic, Matthias Cosler, Mohamed Ghanem, Bernd Finkbeiner

Main category: cs.LG

TL;DR: CNML introduces contrastive learning to model checking, creating aligned representations of logical specifications and systems in a shared latent space, outperforming existing methods and showing strong transfer capabilities.

Details

Motivation: Representation learning remains underexplored in formal verification despite its success in vision and language domains, creating an opportunity to leverage model checking as a guiding signal for learning.

Method: Uses self-supervised contrastive learning to jointly embed logical specifications and systems into a shared latent space, treating model checking as the learning objective.

Result: Significantly outperforms algorithmic and neural baselines on industry-inspired retrieval tasks in both cross-modal and intra-modal settings, with learned representations effectively transferring to downstream tasks and generalizing to complex formulas.

Conclusion: Model checking can successfully serve as an objective for learning representations for formal languages, demonstrating the viability of contrastive learning approaches in formal verification.

Abstract: Model checking is a key technique for verifying safety-critical systems against formal specifications, where recent applications of deep learning have shown promise. However, while ubiquitous for vision and language domains, representation learning remains underexplored in formal verification. We introduce Contrastive Neural Model Checking (CNML), a novel method that leverages the model checking task as a guiding signal for learning aligned representations. CNML jointly embeds logical specifications and systems into a shared latent space through a self-supervised contrastive objective. On industry-inspired retrieval tasks, CNML considerably outperforms both algorithmic and neural baselines in cross-modal and intra-modal settings.We further show that the learned representations effectively transfer to downstream tasks and generalize to more complex formulas. These findings demonstrate that model checking can serve as an objective for learning representations for formal languages.

[470] Explicit Discovery of Nonlinear Symmetries from Dynamic Data

Lexiang Hu, Yikang Li, Zhouchen Lin

Main category: cs.LG

TL;DR: LieNLSD is the first method that can discover nonlinear symmetries by determining the number of infinitesimal generators and their explicit expressions, improving neural PDE solver accuracy by over 20% through data augmentation.

Details

Motivation: Symmetry is crucial for equivariant networks and governing equation discovery, but existing methods are limited to linear symmetries and fail to explicitly obtain Lie algebra subspaces for nonlinear symmetries.

Method: Specify a function library for infinitesimal group action, prove its prolongation formula is linear with respect to coefficient matrix, substitute central differences and neural network Jacobian into infinitesimal criterion, and solve the resulting linear system using SVD.

Result: LieNLSD shows qualitative advantages over existing methods on top quark tagging and dynamic systems, and improves long rollout accuracy of neural PDE solvers by over 20% when used for data augmentation.

Conclusion: LieNLSD successfully discovers nonlinear symmetries by explicitly obtaining Lie algebra subspaces, demonstrating superior performance and practical utility in improving neural PDE solver accuracy.

Abstract: Symmetry is widely applied in problems such as the design of equivariant networks and the discovery of governing equations, but in complex scenarios, it is not known in advance. Most previous symmetry discovery methods are limited to linear symmetries, and recent attempts to discover nonlinear symmetries fail to explicitly get the Lie algebra subspace. In this paper, we propose LieNLSD, which is, to our knowledge, the first method capable of determining the number of infinitesimal generators with nonlinear terms and their explicit expressions. We specify a function library for the infinitesimal group action and aim to solve for its coefficient matrix, proving that its prolongation formula for differential equations, which governs dynamic data, is also linear with respect to the coefficient matrix. By substituting the central differences of the data and the Jacobian matrix of the trained neural network into the infinitesimal criterion, we get a system of linear equations for the coefficient matrix, which can then be solved using SVD. On top quark tagging and a series of dynamic systems, LieNLSD shows qualitative advantages over existing methods and improves the long rollout accuracy of neural PDE solvers by over 20% while applying to guide data augmentation. Code and data are available at https://github.com/hulx2002/LieNLSD.

[471] Compositional meta-learning through probabilistic task inference

Jacob J. W. Bakermans, Pablo Tano, Reidar Riveland, Charles Findling, Alexandre Pouget

Main category: cs.LG

TL;DR: A compositional meta-learning model that represents tasks as structured combinations of reusable computations, enabling rapid learning of new tasks from minimal examples through probabilistic inference without parameter updates.

Details

Motivation: To enable effective knowledge reuse from previous tasks for solving new tasks with minimal experience, addressing the meta-learning problem through compositional approaches.

Method: Learning a generative model that captures underlying components and their statistics shared across tasks, transforming new task learning into probabilistic inference without parameter updates.

Result: Successfully recovers ground truth components and statistics in rule learning and motor learning tasks, and demonstrates ability to quickly infer new solutions from single examples.

Conclusion: The framework combines neural network expressivity with probabilistic inference data-efficiency to achieve rapid compositional meta-learning.

Abstract: To solve a new task from minimal experience, it is essential to effectively reuse knowledge from previous tasks, a problem known as meta-learning. Compositional solutions, where common elements of computation are flexibly recombined into new configurations, are particularly well-suited for meta-learning. Here, we propose a compositional meta-learning model that explicitly represents tasks as structured combinations of reusable computations. We achieve this by learning a generative model that captures the underlying components and their statistics shared across a family of tasks. This approach transforms learning a new task into a probabilistic inference problem, which allows for finding solutions without parameter updates through highly constrained hypothesis testing. Our model successfully recovers ground truth components and statistics in rule learning and motor learning tasks. We then demonstrate its ability to quickly infer new solutions from just single examples. Together, our framework joins the expressivity of neural networks with the data-efficiency of probabilistic inference to achieve rapid compositional meta-learning.

[472] Universal Dynamic Regret and Constraint Violation Bounds for Constrained Online Convex Optimization

Subhamon Supantha, Abhishek Sinha

Main category: cs.LG

TL;DR: The paper presents two modular algorithms for Online Convex Optimization with adversarial constraints, achieving improved dynamic regret and constraint violation bounds.

Details

Motivation: To generalize the Online Convex Optimization framework to handle adversarial constraints where both cost and constraint functions are chosen arbitrarily by an adversary, without requiring common feasible points.

Method: Reduces the constrained learning problem to standard OCO using specially constructed surrogate cost functions, with two modular algorithms.

Result: Achieves universal dynamic regret and cumulative constraint violation bounds that improve upon state-of-the-art results.

Conclusion: The proposed approach successfully handles the most general case of adversarial constraints in OCO through reduction to standard OCO problems.

Abstract: We consider a generalization of the celebrated Online Convex Optimization (OCO) framework with online adversarial constraints. We present two algorithms having simple modular structures that yield universal dynamic regret and cumulative constraint violation bounds, improving upon the state-of-the-art results. Our results hold in the most general case when both the cost and constraint functions are chosen arbitrarily by an adversary, and the constraint functions need not contain any common feasible point. The results are established by reducing the constrained learning problem to an instance of the standard OCO problem with specially constructed surrogate cost functions.

[473] Multimodal Foundation Models for Early Disease Detection

Md Talha Mohsin, Ismail Abdulrashid

Main category: cs.LG

TL;DR: A multimodal foundation model using attention-based transformers to integrate diverse healthcare data sources (EHR, imaging, genetics, wearables) for early disease diagnosis across oncology, cardiology, and neurology.

Details

Motivation: Traditional diagnostic models analyze healthcare data sources in isolation, limiting their ability to identify cross-modal correlations essential for early disease diagnosis.

Method: Uses dedicated encoders to project each modality into a shared latent space, then combines them using multi-head attention and residual normalization. The architecture supports pretraining on multiple tasks for easy adaptation to new diseases and datasets.

Result: The framework demonstrates improved prediction accuracy for early detection tasks across multiple medical domains while incorporating data governance and model management tools for transparency and clinical interpretability.

Conclusion: The proposed multimodal foundation model advances toward unified precision diagnostics, enhancing prediction accuracy and supporting clinical decision-making.

Abstract: Healthcare generates diverse streams of data, including electronic health records (EHR), medical imaging, genetics, and ongoing monitoring from wearable devices. Traditional diagnostic models frequently analyze these sources in isolation, which constrains their capacity to identify cross-modal correlations essential for early disease diagnosis. Our research presents a multimodal foundation model that consolidates diverse patient data through an attention-based transformer framework. At first, dedicated encoders put each modality into a shared latent space. Then, they combine them using multi-head attention and residual normalization. The architecture is made for pretraining on many tasks, which makes it easy to adapt to new diseases and datasets with little extra work. We provide an experimental strategy that uses benchmark datasets in oncology, cardiology, and neurology, with the goal of testing early detection tasks. The framework includes data governance and model management tools in addition to technological performance to improve transparency, reliability, and clinical interpretability. The suggested method works toward a single foundation model for precision diagnostics, which could improve the accuracy of predictions and help doctors make decisions.

[474] StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li

Main category: cs.LG

TL;DR: StockBench is a contamination-free benchmark for evaluating LLM agents in realistic stock trading environments, showing that while most models struggle to beat buy-and-hold strategies, some demonstrate potential for better returns and risk management.

Details

Motivation: Existing financial benchmarks only test static knowledge through QA, failing to capture the dynamic, iterative nature of trading. The finance domain remains underexplored for LLM agents despite its economic importance.

Method: Agents receive daily market signals (prices, fundamentals, news) and make sequential buy/sell/hold decisions. Performance is evaluated using financial metrics like cumulative return, maximum drawdown, and Sortino ratio.

Result: Most LLM agents struggle to outperform buy-and-hold baseline, but several models show potential for higher returns and better risk management. Static financial knowledge doesn’t guarantee successful trading strategies.

Conclusion: StockBench highlights challenges and opportunities in developing LLM-powered financial agents, and is released as open-source to support reproducibility and advance future research.

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals – including prices, fundamentals, and news – and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

[475] Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

Main category: cs.LG

TL;DR: APS enables posterior sampling using pretrained discrete diffusion models for image recovery from noisy measurements without retraining, overcoming limitations of existing discrete diffusion approaches.

Details

Motivation: To leverage discrete diffusion's advantages (unified categorical data modeling, faster inference, training-free Bayesian inference) for posterior sampling, while addressing challenges like sparse guidance signals and dimensionality issues in existing methods.

Method: Anchored Posterior Sampling (APS) with two innovations: quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding.

Result: Achieves state-of-the-art performance among discrete diffusion samplers on linear and nonlinear inverse problems, and demonstrates benefits in training-free stylization and text-guided editing.

Conclusion: APS provides an effective approach for posterior sampling using discrete diffusion foundation models, overcoming key limitations of existing methods and enabling practical applications.

Abstract: We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations – quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.

[476] Randomized Gradient Subspaces for Efficient Large Language Model Training

Sahar Rajabi, Nayeema Nonta, Samanvay Vajpayee, Sirisha Rambhatla

Main category: cs.LG

TL;DR: The paper analyzes gradient space dynamics in LLM training and introduces GrassWalk/GrassJump algorithms that exploit subspace structure to achieve state-of-the-art memory savings while improving performance.

Details

Motivation: Training large language models is bottlenecked by extreme memory demands, particularly from optimizer states. Recent works use gradient projection into low-dimensional subspaces, but the dynamics of gradient space and its underlying subspaces need deeper analysis.

Method: Analyzed gradient space dynamics and introduced randomized algorithms GrassWalk and GrassJump that exploit subspace structure. Found that while small subspaces capture most gradient energy, significant portions remain in residual bulk, and core subspace influence diminishes over time and in deeper layers.

Result: The algorithms achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining. Also observed that gradient space exhibits near-flat curvature, requiring algorithms that account for this geometry.

Conclusion: Understanding gradient space dynamics enables more effective subspace-based optimization methods. The proposed GrassWalk and GrassJump algorithms successfully exploit these insights to reduce memory demands while maintaining or improving model performance.

Abstract: Training large language models (LLMs) is often bottlenecked by extreme memory demands, with optimizer states dominating the footprint. Recent works mitigates this cost by projecting gradients into low-dimensional subspaces using sophisticated update strategies. In this paper, we analyze the dynamics of gradient space and its underlying subspaces. We find that while a small subspace captures most gradient energy, a significant portion still resides in the residual bulk; moreover, the influence of the core subspace diminishes over time and in deeper layers. We also observe that the gradient space exhibits near-flat curvature, calling for algorithms that explicitly account for this geometry. Motivated by these insights, we introduce a suite of randomized algorithms, GrassWalk and GrassJump, which exploit subspace and achieve state-of-the-art memory savings while improving performance on LLaMA-1B and LLaMA-7B pretraining.

[477] Are LLMs Better GNN Helpers? Rethinking Robust Graph Learning under Deficiencies with Iterative Refinement

Zhaoyan Wang, Zheng Gao, Arogya Kharel, In-Young Ko

Main category: cs.LG

TL;DR: This paper introduces RoGRAD, a novel framework that addresses graph deficiencies in GNNs using iterative retrieval-augmented contrastive refinement, outperforming both traditional GNN and LLM-enhanced methods.

Details

Motivation: Current GNNs struggle with real-world graph deficiencies, and there's limited understanding of how graph-native and LLM-enhanced methods perform under compound deficiencies. No comprehensive comparison exists between conventional approaches and recent LLM-on-graph frameworks.

Method: Proposed RoGRAD framework - an iterative paradigm using Retrieval-Augmented Generation (RAG) to inject retrieval-grounded augmentations. It provides class-consistent, diverse augmentations and enforces discriminative representations through iterative graph contrastive learning, transforming LLM augmentation from static to dynamic refinement.

Result: Extensive experiments show RoGRAD’s superiority over both conventional GNN- and LLM-enhanced baselines, achieving up to 82.43% average improvement in performance.

Conclusion: RoGRAD successfully addresses graph deficiencies through iterative refinement, challenging the assumption that LLM augmentation is consistently superior, and provides a more robust framework for graph learning.

Abstract: Graph Neural Networks (GNNs) are widely adopted in Web-related applications, serving as a core technique for learning from graph-structured data, such as text-attributed graphs. Yet in real-world scenarios, such graphs exhibit deficiencies that substantially undermine GNN performance. While prior GNN-based augmentation studies have explored robustness against individual imperfections, a systematic understanding of how graph-native and Large Language Models (LLMs) enhanced methods behave under compound deficiencies is still missing. Specifically, there has been no comprehensive investigation comparing conventional approaches and recent LLM-on-graph frameworks, leaving their merits unclear. To fill this gap, we conduct the first empirical study that benchmarks these two lines of methods across diverse graph deficiencies, revealing overlooked vulnerabilities and challenging the assumption that LLM augmentation is consistently superior. Building on empirical findings, we propose Robust Graph Learning via Retrieval-Augmented Contrastive Refinement (RoGRAD) framework. Unlike prior one-shot LLM-as-Enhancer designs, RoGRAD is the first iterative paradigm that leverages Retrieval-Augmented Generation (RAG) to inject retrieval-grounded augmentations by supplying class-consistent, diverse augmentations and enforcing discriminative representations through iterative graph contrastive learning. It transforms LLM augmentation for graphs from static signal injection into dynamic refinement. Extensive experiments demonstrate RoGRAD’s superiority over both conventional GNN- and LLM-enhanced baselines, achieving up to 82.43% average improvement.

[478] Continual Personalization for Diffusion Models

Yu-Chien Liao, Jr-Jen Chen, Chi-Pin Huang, Ci-Siang Lin, Meng-Lin Wu, Yu-Chiang Frank Wang

Main category: cs.LG

TL;DR: CNS is a novel approach for continual personalization of diffusion models that identifies and selectively fine-tunes concept-related neurons to prevent catastrophic forgetting while maintaining zero-shot generation capabilities.

Details

Motivation: Updating diffusion models incrementally is practical for real-world applications but computationally challenging, requiring methods that can personalize models without losing previous knowledge.

Method: Concept Neuron Selection (CNS) identifies neurons related to target concepts in diffusion models and fine-tunes them incrementally while preserving knowledge from previous concepts through joint optimization.

Result: CNS achieves state-of-the-art performance on real-world datasets with minimal parameter adjustments, outperforming previous methods in single and multi-concept personalization, and enables fusion-free operation reducing memory and processing time.

Conclusion: CNS provides an effective solution for continual personalization of diffusion models that balances performance, computational efficiency, and knowledge preservation in incremental learning scenarios.

Abstract: Updating diffusion models in an incremental setting would be practical in real-world applications yet computationally challenging. We present a novel learning strategy of Concept Neuron Selection (CNS), a simple yet effective approach to perform personalization in a continual learning scheme. CNS uniquely identifies neurons in diffusion models that are closely related to the target concepts. In order to mitigate catastrophic forgetting problems while preserving zero-shot text-to-image generation ability, CNS finetunes concept neurons in an incremental manner and jointly preserves knowledge learned of previous concepts. Evaluation of real-world datasets demonstrates that CNS achieves state-of-the-art performance with minimal parameter adjustments, outperforming previous methods in both single and multi-concept personalization works. CNS also achieves fusion-free operation, reducing memory storage and processing time for continual personalization.

[479] Multi-marginal temporal Schrödinger Bridge Matching for video generation from unpaired data

Thomas Gravier, Thomas Boyer, Auguste Genovesio

Main category: cs.LG

TL;DR: MMtSBM is a novel method for reconstructing temporal dynamics from static snapshots using multi-marginal Schrödinger bridges, achieving state-of-the-art performance in high-dimensional settings including transcriptomics and image data.

Details

Motivation: Many natural processes can only be observed through static snapshots, making it challenging to reconstruct their temporal evolution. Existing approaches have scalability issues in high dimensions and require restrictive assumptions.

Method: Extends Diffusion Schrödinger Bridge Matching to multiple marginals using a novel factorized approach called Iterative Markovian Fitting algorithm, enabling video generation from unpaired data.

Result: MMtSBM retains theoretical properties on toy examples, achieves state-of-the-art performance on real datasets (including 100D transcriptomic trajectory inference), and successfully recovers couplings and dynamics in very high-dimensional image settings for the first time.

Conclusion: Multi-marginal Schrödinger bridges provide a practical and principled approach for recovering hidden dynamics from static data across various domains.

Abstract: Many natural dynamic processes – such as in vivo cellular differentiation or disease progression – can only be observed through the lens of static sample snapshots. While challenging, reconstructing their temporal evolution to decipher underlying dynamic properties is of major interest to scientific research. Existing approaches enable data transport along a temporal axis but are poorly scalable in high dimension and require restrictive assumptions to be met. To address these issues, we propose \textit{\textbf{Multi-Marginal temporal Schr"odinger Bridge Matching}} (\textbf{MMtSBM}) \textit{for video generation from unpaired data}, extending the theoretical guarantees and empirical efficiency of Diffusion Schr"odinger Bridge Matching (arXiv:archive/2303.16852) by deriving the Iterative Markovian Fitting algorithm to multiple marginals in a novel factorized fashion. Experiments show that MMtSBM retains theoretical properties on toy examples, achieves state-of-the-art performance on real world datasets such as transcriptomic trajectory inference in 100 dimensions, and for the first time recovers couplings and dynamics in very high dimensional image settings. Our work establishes multi-marginal Schr"odinger bridges as a practical and principled approach for recovering hidden dynamics from static data.

[480] ExGRPO: Learning to Reason from Experience

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng

Main category: cs.LG

TL;DR: ExGRPO improves RLVR by reusing valuable reasoning experiences based on correctness and entropy, achieving better performance and stability than standard on-policy methods.

Details

Motivation: Standard on-policy RLVR discards experiences after single use, causing inefficiency and instability. The value of reasoning experiences and their characteristics in learning dynamics is underexplored.

Method: Proposed ExGRPO framework that identifies rollout correctness and entropy as experience value indicators, organizes experiences into groups, and uses mixed-policy objective to balance exploration and experience exploitation.

Result: Consistent improvement on mathematical/general benchmarks with +3.5/7.6 point gains over on-policy RLVR across 1.5B-8B parameter models. Stabilizes training where on-policy methods fail.

Conclusion: Principled experience management is crucial for efficient and scalable RLVR, with experience characteristics playing a key role in reasoning model learning dynamics.

Abstract: Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

[481] Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

Runqian Wang, Yilun Du

Main category: cs.LG

TL;DR: Equilibrium Matching (EqM) is a generative modeling framework that learns the equilibrium gradient of an implicit energy landscape, enabling optimization-based sampling with adjustable parameters and achieving state-of-the-art performance on ImageNet 256×256.

Details

Motivation: To overcome limitations of traditional diffusion and flow-based models that use non-equilibrium, time-conditional dynamics, by adopting an equilibrium dynamics perspective for more flexible and efficient generation.

Method: Discards time-conditional dynamics and learns the equilibrium gradient of an implicit energy landscape. Uses optimization-based sampling with gradient descent, adjustable step sizes, adaptive optimizers, and adaptive compute during inference.

Result: Achieves FID of 1.90 on ImageNet 256×256, surpassing diffusion/flow models. Also handles tasks like partially noised image denoising, OOD detection, and image composition.

Conclusion: EqM provides a unified equilibrium landscape that bridges flow and energy-based models, offering optimization-driven inference and theoretical justification for learning from the data manifold.

Abstract: We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.

[482] A Methodology for Transparent Logic-Based Classification Using a Multi-Task Convolutional Tsetlin Machine

Mayur Kishor Shende, Ole-Christoffer Granmo, Runar Helin, Vladimir I. Zadorozhny, Rishad Shafik

Main category: cs.LG

TL;DR: The paper explores applying Tsetlin Machine architecture to large-scale multi-channel image classification, achieving competitive performance while maintaining interpretability through local and global pattern visualization.

Details

Motivation: To extend the Tsetlin Machine's interpretable machine learning paradigm to complex, large-scale RGB image classification tasks while maintaining competitive performance against deep learning models.

Method: Proposed methodology using convolutional Tsetlin Machine with finite-state automata and propositional logic to generate local interpretations and global class representations that visualize captured patterns as images.

Result: Achieved 98.5% accuracy on MNIST and 86.56% F1-score on CelebA (compared to 88.07% for ResNet50), showing competitive performance while maintaining interpretability in complex training environments.

Conclusion: Tsetlin Machines can perform competitively with deep learning models on complex datasets while providing interpretable insights through visualized patterns, enabling better understanding and application to diverse datasets.

Abstract: The Tsetlin Machine (TM) is a novel machine learning paradigm that employs finite-state automata for learning and utilizes propositional logic to represent patterns. Due to its simplistic approach, TMs are inherently more interpretable than learning algorithms based on Neural Networks. The Convolutional TM has shown comparable performance on various datasets such as MNIST, K-MNIST, F-MNIST and CIFAR-2. In this paper, we explore the applicability of the TM architecture for large-scale multi-channel (RGB) image classification. We propose a methodology to generate both local interpretations and global class representations. The local interpretations can be used to explain the model predictions while the global class representations aggregate important patterns for each class. These interpretations summarize the knowledge captured by the convolutional clauses, which can be visualized as images. We evaluate our methods on MNIST and CelebA datasets, using models that achieve 98.5% accuracy on MNIST and 86.56% F1-score on CelebA (compared to 88.07% for ResNet50) respectively. We show that the TM performs competitively to this deep learning model while maintaining its interpretability, even in large-scale complex training environments. This contributes to a better understanding of TM clauses and provides insights into how these models can be applied to more complex and diverse datasets.

[483] Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth

Main category: cs.LG

TL;DR: DialTree-RPO is an RL framework with tree search that autonomously discovers diverse multi-turn attack strategies against LLMs, achieving 25.9% higher attack success rates than previous methods.

Details

Motivation: Current LLMs remain vulnerable to multi-turn adversarial attacks, but existing approaches rely on manual red-teaming or pre-defined templates, failing to explore the vast space of possible multi-turn attack strategies that emerge from complex dialogue dynamics.

Method: An on-policy reinforcement learning framework integrated with tree search that treats dialogue as a sequential decision-making problem, enabling systematic exploration of multi-turn attacks without manually curated data.

Result: Achieves more than 25.9% higher Attack Success Rate (ASR) across 10 target models compared to previous state-of-the-art approaches, and effectively uncovers new attack strategies by learning optimal dialogue policies.

Conclusion: The proposed framework successfully addresses the critical gap in multi-turn attack discovery and demonstrates significantly improved effectiveness in finding safety vulnerabilities in LLMs through autonomous exploration of dialogue strategies.

Abstract: Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.

[484] StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu

Main category: cs.LG

TL;DR: A geometry-aware extension of LoRA that uses a three-factor decomposition USV⊤ with orthonormal constraints on U and V via Stiefel manifold optimization, achieving superior performance across various tasks.

Details

Motivation: LoRA lags behind full fine-tuning due to insufficient exploitation of geometric structure in low-rank manifolds.

Method: Three-factor decomposition USV⊤ with U and V constrained to Stiefel manifold for orthonormality, using geometric optimization to convert Euclidean optimizers to Riemannian ones.

Result: Superior performance across commonsense reasoning, math/code generation, image classification, and image generation tasks compared to state-of-the-art LoRA variants.

Conclusion: The geometry-aware approach effectively improves LoRA performance while maintaining compatibility with existing fine-tuning pipelines.

Abstract: Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U!SV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter’s input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.

[485] KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting

Kuiye Ding, Fanda Fan, Zheya Wang, Hongxiao Li, Yifan Wang, Lei Wang, Chunjie Luo, Jianfeng Zhan

Main category: cs.LG

TL;DR: KAIROS is a non-autoregressive time series forecasting framework that directly models segment-level multi-peak distributions, achieving fast inference and strong zero-shot generalization on web applications.

Details

Motivation: Web applications require fast-responsive time series forecasting for real-time decision making in resource planning, cache placement, and anomaly response, but existing autoregressive approaches suffer from error accumulation while non-autoregressive models produce over-smoothed predictions.

Method: KAIROS uses a non-autoregressive framework that directly models segment-level multi-peak distributions, avoiding error accumulation and enabling just-in-time inference.

Result: KAIROS demonstrates strong zero-shot generalization on six benchmarks, achieving forecasting performance comparable to state-of-the-art foundation models with similar scale, but at a fraction of their inference cost.

Conclusion: KAIROS highlights non-autoregressive design as a scalable paradigm for foundation models in time series, offering efficient forecasting for web applications.

Abstract: In the World Wide Web, reliable time series forecasts provide the forward-looking signals that drive resource planning, cache placement, and anomaly response, enabling platforms to operate efficiently as user behavior and content distributions evolve. Compared with other domains, time series forecasting for Web applications requires much faster responsiveness to support real-time decision making. We present KAIROS, a non-autoregressive time series forecasting framework that directly models segment-level multi-peak distributions. Unlike autoregressive approaches, KAIROS avoids error accumulation and achieves just-in-time inference, while improving over existing non-autoregressive models that collapse to over-smoothed predictions. Trained on the large-scale corpus, KAIROS demonstrates strong zero-shot generalization on six widely used benchmarks, delivering forecasting performance comparable to state-of-the-art foundation models with similar scale, at a fraction of their inference cost. Beyond empirical results, KAIROS highlights the importance of non-autoregressive design as a scalable paradigm for foundation models in time series.

[486] Interactive Training: Feedback-Driven Neural Network Optimization

Wentao Zhang, Yang Young Lu, Yuntian Deng

Main category: cs.LG

TL;DR: Interactive Training is a framework enabling real-time human or AI intervention in neural network training through a control server, allowing dynamic adjustment of hyperparameters, data, and checkpoints to improve stability and adaptability.

Details

Motivation: Traditional neural network training follows fixed optimization recipes without flexibility to respond to instabilities or emerging issues during training.

Method: Uses a control server to mediate communication between users/AI agents and training process, enabling dynamic adjustment of optimizer hyperparameters, training data, and model checkpoints in real-time.

Result: Through three case studies, achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs.

Conclusion: Paves the way for autonomous AI agents to monitor training logs, proactively resolve instabilities, and optimize training dynamics.

Abstract: Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.

[487] Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions

Camilo Andrés García Trillos, Nicolás García Trillos

Main category: cs.LG

TL;DR: The paper develops dual and barycentric reformulations for adversarially robust multiclass classification under arbitrary loss functions, enabling efficient computation of sharp lower bounds for adversarial risks.

Details

Motivation: To extend adversarial robustness analysis beyond the 0-1 loss setting to more practical loss functions like cross-entropy, power-form losses, and quadratic loss, facilitating robust classifier design in multiclass settings.

Method: Derives dual and barycentric reformulations of robust risk minimization problems, providing explicit characterizations for cross-entropy loss, power-form losses, and quadratic loss. Connects adversarial robustness to α-fair packing problems and generalized barycenter problems with entropy penalties.

Result: Enables efficient computation of sharp lower bounds for adversarial risks, with numerical experiments showing tighter bounds for cross-entropy loss compared to previous methods.

Conclusion: The reformulations provide powerful tools for analyzing adversarial robustness beyond 0-1 loss, uncovering connections to optimization problems and entropy-based penalties, with practical applications in robust classifier design.

Abstract: We consider adversarially robust classification in a multiclass setting under arbitrary loss functions and derive dual and barycentric reformulations of the corresponding learner-agnostic robust risk minimization problem. We provide explicit characterizations for important cases such as the cross-entropy loss, loss functions with a power form, and the quadratic loss, extending in this way available results for the 0-1 loss. These reformulations enable efficient computation of sharp lower bounds for adversarial risks and facilitate the design of robust classifiers beyond the 0-1 loss setting. Our paper uncovers interesting connections between adversarial robustness, $\alpha$-fair packing problems, and generalized barycenter problems for arbitrary positive measures where Kullback-Leibler and Tsallis entropies are used as penalties. Our theoretical results are accompanied with illustrative numerical experiments where we obtain tighter lower bounds for adversarial risks with the cross-entropy loss function.

[488] Moon: A Modality Conversion-based Efficient Multivariate Time Series Anomaly Detection

Yuanyuan Yao, Yuhan Shi, Lu Chen, Ziquan Fang, Yunjun Gao, Leong Hou U, Yushuai Li, Tianyi Li

Main category: cs.LG

TL;DR: Moon is a supervised modality conversion framework for multivariate time series anomaly detection that converts numeric time series to images using MV-MTF, integrates both data types via Multimodal-CNN with parameter sharing, and provides SHAP-based anomaly explanations.

Details

Motivation: Existing MTS anomaly detection methods face challenges: unsupervised methods rely on error thresholds causing inaccuracies, semi-supervised methods underuse anomaly labels, and supervised methods fail to capture local relationships with high computational costs and data scarcity.

Method: Proposes Moon framework with three components: (1) MV-MTF converts numeric time series to image representations capturing variable and timestamp relationships, (2) Multimodal-CNN with parameter sharing integrates numeric and image data through feature fusion, (3) SHAP-based anomaly explainer identifies key contributing variables.

Result: Extensive experiments on six real-world datasets show Moon outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy, and 10.8% in interpretation performance.

Conclusion: Moon effectively addresses limitations of existing MTS anomaly detection methods by combining modality conversion, multimodal integration, and interpretable analysis, achieving superior efficiency, accuracy, and interpretability.

Abstract: Multivariate time series (MTS) anomaly detection identifies abnormal patterns where each timestamp contains multiple variables. Existing MTS anomaly detection methods fall into three categories: reconstruction-based, prediction-based, and classifier-based methods. However, these methods face two key challenges: (1) Unsupervised learning methods, such as reconstruction-based and prediction-based methods, rely on error thresholds, which can lead to inaccuracies; (2) Semi-supervised methods mainly model normal data and often underuse anomaly labels, limiting detection of subtle anomalies;(3) Supervised learning methods, such as classifier-based approaches, often fail to capture local relationships, incur high computational costs, and are constrained by the scarcity of labeled data. To address these limitations, we propose Moon, a supervised modality conversion-based multivariate time series anomaly detection framework. Moon enhances the efficiency and accuracy of anomaly detection while providing detailed anomaly analysis reports. First, Moon introduces a novel multivariate Markov Transition Field (MV-MTF) technique to convert numeric time series data into image representations, capturing relationships across variables and timestamps. Since numeric data retains unique patterns that cannot be fully captured by image conversion alone, Moon employs a Multimodal-CNN to integrate numeric and image data through a feature fusion model with parameter sharing, enhancing training efficiency. Finally, a SHAP-based anomaly explainer identifies key variables contributing to anomalies, improving interpretability. Extensive experiments on six real-world MTS datasets demonstrate that Moon outperforms six state-of-the-art methods by up to 93% in efficiency, 4% in accuracy and, 10.8% in interpretation performance.

[489] Private Federated Multiclass Post-hoc Calibration

Samuel Maddock, Graham Cormode, Carsten Maple

Main category: cs.LG

TL;DR: This paper introduces federated calibration methods for machine learning models in FL settings, addressing the overlooked need for probability calibration in privacy-preserving distributed learning environments.

Details

Motivation: Calibration is crucial for reliable decision-making in applications like healthcare and finance, but federated private calibration has been largely overlooked despite FL's use in these domains where calibration is strongly required.

Method: Transferred traditional centralized calibration methods (histogram binning and temperature scaling) into federated environments and developed new methods to handle client heterogeneity. Proposed strategies for both standard FL and user-level Differential Privacy settings.

Result: Federation and DP impact calibration accuracy, with degradation commonly observed under heterogeneity. Federated temperature scaling works best for DP-FL, while weighted binning approach is best when DP is not required.

Conclusion: The integration of post-hoc model calibration techniques within FL is feasible and necessary, with different methods performing optimally depending on whether Differential Privacy is applied.

Abstract: Calibrating machine learning models so that predicted probabilities better reflect the true outcome frequencies is crucial for reliable decision-making across many applications. In Federated Learning (FL), the goal is to train a global model on data which is distributed across multiple clients and cannot be centralized due to privacy concerns. FL is applied in key areas such as healthcare and finance where calibration is strongly required, yet federated private calibration has been largely overlooked. This work introduces the integration of post-hoc model calibration techniques within FL. Specifically, we transfer traditional centralized calibration methods such as histogram binning and temperature scaling into federated environments and define new methods to operate them under strong client heterogeneity. We study (1) a federated setting and (2) a user-level Differential Privacy (DP) setting and demonstrate how both federation and DP impacts calibration accuracy. We propose strategies to mitigate degradation commonly observed under heterogeneity and our findings highlight that our federated temperature scaling works best for DP-FL whereas our weighted binning approach is best when DP is not required.

[490] PepCompass: Navigating peptide embedding spaces using Riemannian Geometry

Marcin Możejko, Adam Bielecki, Jurand Prądzyński, Marcin Traskowski, Antoni Janowski, Karol Jurasz, Michał Kucharczyk, Hyun-Su Lee, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Paulina Szymczak, Michał Kmicikiewicz, Ewa Szczurek

Main category: cs.LG

TL;DR: PepCompass is a geometry-aware framework for antimicrobial peptide discovery that addresses limitations of conventional generative models by using Riemannian manifolds to capture local geometry, enabling more efficient exploration and optimization of peptide space.

Details

Motivation: Current generative models for antimicrobial peptide discovery rely on flat Euclidean metrics and ignore decoder-induced geometry, leading to distorted and inefficient exploration of the vast peptide space where active peptides are scarce.

Method: PepCompass introduces Union of κ-Stable Riemannian Manifolds to capture local geometry, Second-Order Riemannian Brownian Efficient Sampling for exploration, Mutation Enumeration in Tangent Space for discrete substitutions, Local Enumeration Bayesian Optimization (LE-BO) for local activity optimization, and Potential-minimizing Geodesic Search (PoGS) for property-enriched geodesic interpolation.

Result: In-vitro validation showed PoGS yielded four novel seeds, and subsequent LE-BO optimization discovered 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains.

Conclusion: Geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design, overcoming limitations of conventional approaches and enabling efficient discovery of active peptides.

Abstract: Antimicrobial peptide discovery is challenged by the astronomical size of peptide space and the relative scarcity of active peptides. Generative models provide continuous latent “maps” of peptide space, but conventionally ignore decoder-induced geometry and rely on flat Euclidean metrics, rendering exploration and optimization distorted and inefficient. Prior manifold-based remedies assume fixed intrinsic dimensionality, which critically fails in practice for peptide data. Here, we introduce PepCompass, a geometry-aware framework for peptide exploration and optimization. At its core, we define a Union of $\kappa$-Stable Riemannian Manifolds $\mathbb{M}^{\kappa}$, a family of decoder-induced manifolds that captures local geometry while ensuring computational stability. We propose two local exploration methods: Second-Order Riemannian Brownian Efficient Sampling, which provides a convergent second-order approximation to Riemannian Brownian motion, and Mutation Enumeration in Tangent Space, which reinterprets tangent directions as discrete amino-acid substitutions. Combining these yields Local Enumeration Bayesian Optimization (LE-BO), an efficient algorithm for local activity optimization. Finally, we introduce Potential-minimizing Geodesic Search (PoGS), which interpolates between prototype embeddings along property-enriched geodesics, biasing discovery toward seeds, i.e. peptides with favorable activity. In-vitro validation confirms the effectiveness of PepCompass: PoGS yields four novel seeds, and subsequent optimization with LE-BO discovers 25 highly active peptides with broad-spectrum activity, including against resistant bacterial strains. These results demonstrate that geometry-informed exploration provides a powerful new paradigm for antimicrobial peptide design.

[491] Normality Calibration in Semi-supervised Graph Anomaly Detection

Guolei Zeng, Hezhe Qiao, Guoguo Ai, Jinsong Guo, Guansong Pang

Main category: cs.LG

TL;DR: GraphNC is a graph normality calibration framework that improves semi-supervised graph anomaly detection by calibrating normality in both anomaly score and representation spaces using teacher-student learning.

Details

Motivation: Existing semi-supervised GAD methods overfit to labeled normal nodes, leading to high false positives. The limited normality learning from only labeled data restricts detection performance.

Method: Uses teacher-student framework with two components: ScoreDA (anomaly score distribution alignment) and NormReg (perturbation-based normality regularization). ScoreDA aligns anomaly scores with teacher model’s distribution, while NormReg makes normal node representations more compact through consistency loss.

Result: The framework effectively separates anomaly scores of normal and abnormal classes, making them more separable. It mitigates misleading scores from the teacher model through representation space regularization.

Conclusion: GraphNC successfully addresses the overfitting problem in semi-supervised GAD by jointly calibrating normality in score and representation spaces, leading to improved anomaly detection performance.

Abstract: Graph anomaly detection (GAD) has attracted growing interest for its crucial ability to uncover irregular patterns in broad applications. Semi-supervised GAD, which assumes a subset of annotated normal nodes available during training, is among the most widely explored application settings. However, the normality learned by existing semi-supervised GAD methods is limited to the labeled normal nodes, often inclining to overfitting the given patterns. These can lead to high detection errors, such as high false positives. To overcome this limitation, we propose GraphNC , a graph normality calibration framework that leverages both labeled and unlabeled data to calibrate the normality from a teacher model (a pre-trained semi-supervised GAD model) jointly in anomaly score and node representation spaces. GraphNC includes two main components, anomaly score distribution alignment (ScoreDA) and perturbation-based normality regularization (NormReg). ScoreDA optimizes the anomaly scores of our model by aligning them with the score distribution yielded by the teacher model. Due to accurate scores in most of the normal nodes and part of the anomaly nodes in the teacher model, the score alignment effectively pulls the anomaly scores of the normal and abnormal classes toward the two ends, resulting in more separable anomaly scores. Nevertheless, there are inaccurate scores from the teacher model. To mitigate the misleading by these scores, NormReg is designed to regularize the graph normality in the representation space, making the representations of normal nodes more compact by minimizing a perturbation-guided consistency loss solely on the labeled nodes.

[492] FairContrast: Enhancing Fairness through Contrastive learning and Customized Augmenting Methods on Tabular Data

Aida Tayebi, Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay

Main category: cs.LG

TL;DR: A contrastive learning framework for learning fair representations in tabular data that reduces bias while maintaining accuracy.

Details

Motivation: Fair AI systems are critical as AI becomes more embedded in everyday life, but fairness in learned representations for tabular data remains underexplored despite the effectiveness of representation learning for debiasing.

Method: A contrastive learning framework with strategic positive pair selection using both supervised and self-supervised contrastive learning specifically designed for tabular datasets.

Result: Significantly reduces bias compared to existing state-of-the-art contrastive learning models for tabular data with minimal accuracy trade-off.

Conclusion: The framework effectively mitigates bias in tabular data representations and enables fair representations for various downstream tasks.

Abstract: As AI systems become more embedded in everyday life, the development of fair and unbiased models becomes more critical. Considering the social impact of AI systems is not merely a technical challenge but a moral imperative. As evidenced in numerous research studies, learning fair and robust representations has proven to be a powerful approach to effectively debiasing algorithms and improving fairness while maintaining essential information for prediction tasks. Representation learning frameworks, particularly those that utilize self-supervised and contrastive learning, have demonstrated superior robustness and generalizability across various domains. Despite the growing interest in applying these approaches to tabular data, the issue of fairness in these learned representations remains underexplored. In this study, we introduce a contrastive learning framework specifically designed to address bias and learn fair representations in tabular datasets. By strategically selecting positive pair samples and employing supervised and self-supervised contrastive learning, we significantly reduce bias compared to existing state-of-the-art contrastive learning models for tabular data. Our results demonstrate the efficacy of our approach in mitigating bias with minimum trade-off in accuracy and leveraging the learned fair representations in various downstream tasks.

[493] GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora, Devon Hjelm, Alexander Toshev, Omar Attia, Bogdan Mazoure

Main category: cs.LG

TL;DR: GRACE uses LLMs with evolutionary search to generate interpretable code-based reward functions from expert demonstrations, outperforming traditional black-box IRL methods.

Details

Motivation: Traditional Inverse Reinforcement Learning produces black-box reward models that are difficult to interpret and debug, limiting practical applications.

Method: GRACE combines Large Language Models with evolutionary search to reverse-engineer executable code-based reward functions directly from expert trajectories.

Result: GRACE learns highly accurate rewards on BabyAI and AndroidWorld benchmarks, produces strong policies comparable to ground-truth reward approaches, and builds complex reward APIs in multi-task settings.

Conclusion: GRACE provides an effective approach for generating interpretable, code-based reward functions that can be inspected and verified, addressing key limitations of traditional IRL methods.

Abstract: Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield “black-box” models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

[494] Mathematical Modeling and Convergence Analysis of Deep Neural Networks with Dense Layer Connectivities in Deep Learning

Jinshu Huang, Haibin Su, Xue-Cheng Tai, Chunlin Wu

Main category: cs.LG

TL;DR: This paper provides a mathematical framework for analyzing densely connected deep neural networks (DNNs) using nonlinear integral equations and optimal control theory, showing convergence from discrete to continuous formulations.

Details

Motivation: To establish a rigorous mathematical foundation for understanding densely connected DNNs and analyze their learning problems in the deep-layer limit, addressing limitations of prior ODE-based approaches.

Method: Developed a dense non-local (DNL) framework modeling densely connected DNNs as nonlinear integral equations, studied training problems from optimal control perspective, and used piecewise linear extension with Γ-convergence analysis.

Result: Proved convergence of optimal values and subsequence convergence of minimizers from discrete network learning to continuous-time counterpart, demonstrating training stability for deep models.

Conclusion: The mathematical framework provides theoretical understanding of densely connected DNNs and suggests such architectures offer training stability for deep models.

Abstract: In deep learning, dense layer connectivity has become a key design principle in deep neural networks (DNNs), enabling efficient information flow and strong performance across a range of applications. In this work, we model densely connected DNNs mathematically and analyze their learning problems in the deep-layer limit. For a broad applicability, we present our analysis in a framework setting of DNNs with densely connected layers and general non-local feature transformations (with local feature transformations as special cases) within layers, which is called dense non-local (DNL) framework and includes standard DenseNets and variants as special examples. In this formulation, the densely connected networks are modeled as nonlinear integral equations, in contrast to the ordinary differential equation viewpoint commonly adopted in prior works. We study the associated training problems from an optimal control perspective and prove convergence results from the network learning problem to its continuous-time counterpart. In particular, we show the convergence of optimal values and the subsequence convergence of minimizers, using a piecewise linear extension and $\Gamma$-convergence analysis. Our results provide a mathematical foundation for understanding densely connected DNNs and further suggest that such architectures can offer stability of training deep models.

[495] Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025

Matthew A. Reyna, Zuzana Koscova, Jan Pavlus, Soheil Saghafi, James Weigle, Andoni Elola, Salman Seyedi, Kiersten Campbell, Qiao Li, Ali Bahrami Rad, Antônio H. Ribeiro, Antonio Luiz P. Ribeiro, Reza Sameni, Gari D. Clifford

Main category: cs.LG

TL;DR: The paper describes the George B. Moody PhysioNet Challenge 2025 focused on developing algorithms to identify Chagas disease from ECGs, using a triage approach to prioritize patients for limited serological testing.

Details

Motivation: Chagas disease is endemic in South/Central America and the US, causing cardiovascular and digestive problems. Limited serological testing capacity creates need for ECG-based screening since Chagas cardiomyopathy often manifests in ECGs.

Method: Used multiple datasets with weak labels from patient reports and strong labels from serological testing. Applied data augmentation for model robustness and generalizability. Framed as triage task with evaluation metric reflecting local testing capacity.

Result: Over 630 participants from 111 teams submitted more than 1300 entries during the Challenge, representing diverse approaches from academia and industry worldwide.

Conclusion: The Challenge successfully developed algorithmic approaches for ECG-based Chagas disease identification, addressing the limited testing capacity through a triage framework that can prioritize patients for serological testing and treatment.

Abstract: Objective: Chagas disease is a parasitic infection that is endemic to South America, Central America, and, more recently, the U.S., primarily transmitted by insects. Chronic Chagas disease can cause cardiovascular diseases and digestive problems. Serological testing capacities for Chagas disease are limited, but Chagas cardiomyopathy often manifests in ECGs, providing an opportunity to prioritize patients for testing and treatment. Approach: The George B. Moody PhysioNet Challenge 2025 invites teams to develop algorithmic approaches for identifying Chagas disease from electrocardiograms (ECGs). Main results: This Challenge provides multiple innovations. First, we leveraged several datasets with labels from patient reports and serological testing, provided a large dataset with weak labels and smaller datasets with strong labels. Second, we augmented the data to support model robustness and generalizability to unseen data sources. Third, we applied an evaluation metric that captured the local serological testing capacity for Chagas disease to frame the machine learning problem as a triage task. Significance: Over 630 participants from 111 teams submitted over 1300 entries during the Challenge, representing diverse approaches from academia and industry worldwide.

[496] Adaptive Heterogeneous Mixtures of Normalising Flows for Robust Variational Inference

Benjamin Wiriyapong, Oktay Karakuş, Kirill Sidorov

Main category: cs.LG

TL;DR: AMF-VI is a two-stage adaptive mixture of complementary normalizing flows (MAF, RealNVP, RBIG) that achieves robust variational inference across diverse posterior distributions without architectural changes.

Details

Motivation: Single-flow variational inference models often behave inconsistently across different distribution types, lacking robustness across qualitatively different posterior families.

Method: Two-stage training: (1) sequential expert training of individual flows, (2) adaptive global weight estimation via likelihood-driven updates without per-sample gating or architectural modifications.

Result: Achieves consistently lower negative log-likelihood than single-flow baselines across six canonical posterior families, with stable gains in Wasserstein-2 and MMD metrics, indicating improved robustness across shapes and modalities.

Conclusion: Adaptive mixtures of diverse flows provide a reliable route to robust variational inference across diverse posterior families while preserving each expert’s inductive bias, with minimal computational overhead.

Abstract: Normalising-flow variational inference (VI) can approximate complex posteriors, yet single-flow models often behave inconsistently across qualitatively different distributions. We propose Adaptive Mixture Flow Variational Inference (AMF-VI), a heterogeneous mixture of complementary flows (MAF, RealNVP, RBIG) trained in two stages: (i) sequential expert training of individual flows, and (ii) adaptive global weight estimation via likelihood-driven updates, without per-sample gating or architectural changes. Evaluated on six canonical posterior families of banana, X-shape, two-moons, rings, a bimodal, and a five-mode mixture, AMF-VI achieves consistently lower negative log-likelihood than each single-flow baseline and delivers stable gains in transport metrics (Wasserstein-2) and maximum mean discrepancy (MDD), indicating improved robustness across shapes and modalities. The procedure is efficient and architecture-agnostic, incurring minimal overhead relative to standard flow training, and demonstrates that adaptive mixtures of diverse flows provide a reliable route to robust VI across diverse posterior families whilst preserving each expert’s inductive bias.

[497] DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning

Hanyang Zhao, Dawen Liang, Wenpin Tang, David Yao, Nathan Kallus

Main category: cs.LG

TL;DR: DiFFPO is a unified RL framework that trains masked diffusion LLMs to reason better and faster by using surrogate policies and joint training of efficient samplers.

Details

Motivation: To improve both the reasoning quality and inference speed of diffusion large language models through reinforcement learning, addressing the trade-off between performance and computational efficiency.

Method: Uses off-policy RL with surrogate policies, two-stage likelihood approximation with importance sampling, and joint training of adaptive inference samplers that allocate thresholds per prompt.

Result: Achieves better accuracy with fewer function evaluations, improving the Pareto frontier of inference-time compute for dLLMs on math and planning tasks.

Conclusion: DiFFPO successfully unifies RL training for diffusion LLMs, enabling simultaneous improvements in reasoning quality and inference efficiency through joint policy and sampler optimization.

Abstract: We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy. This naturally motivates a more accurate and informative two-stage likelihood approximation combined with importance sampling correction, which leads to generalized RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs’ natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt. By jointly training the sampler, we yield better accuracies with lower number of function evaluations (NFEs) compared to training the model only, obtaining the best performance in improving the Pareto frontier of the inference-time compute of dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.

[498] Inferring Optical Tissue Properties from Photoplethysmography using Hybrid Amortized Inference

Jens Behrmann, Maria R. Cervera, Antoine Wehenkel, Andrew C. Miller, Albert Cerussi, Pranay Jain, Vivek Venugopal, Shijie Yan, Guillermo Sapiro, Luca Pegolotti, Jörn-Henrik Jacobsen

Main category: cs.LG

TL;DR: PPGen is a biophysical model that relates PPG signals to interpretable physiological parameters, using hybrid amortized inference for robust parameter estimation while maintaining clinical interpretability.

Details

Motivation: Current deep learning models for PPG analysis rely on features with unclear physiological meaning, creating tension between predictive power, clinical interpretability, and sensor design.

Method: Introduced PPGen biophysical model and hybrid amortized inference (HAI) for fast, robust estimation of physiological parameters from PPG signals while correcting for model misspecification.

Result: HAI accurately infers physiological parameters under diverse noise and sensor conditions in extensive in-silico experiments.

Conclusion: PPGen provides a path toward PPG models that retain DL-based feature fidelity while supporting clinical interpretation and informed hardware design.

Abstract: Smart wearables enable continuous tracking of established biomarkers such as heart rate, heart rate variability, and blood oxygen saturation via photoplethysmography (PPG). Beyond these metrics, PPG waveforms contain richer physiological information, as recent deep learning (DL) studies demonstrate. However, DL models often rely on features with unclear physiological meaning, creating a tension between predictive power, clinical interpretability, and sensor design. We address this gap by introducing PPGen, a biophysical model that relates PPG signals to interpretable physiological and optical parameters. Building on PPGen, we propose hybrid amortized inference (HAI), enabling fast, robust, and scalable estimation of relevant physiological parameters from PPG signals while correcting for model misspecification. In extensive in-silico experiments, we show that HAI can accurately infer physiological parameters under diverse noise and sensor conditions. Our results illustrate a path toward PPG models that retain the fidelity needed for DL-based features while supporting clinical interpretation and informed hardware design.

[499] How to Combat Reactive and Dynamic Jamming Attacks with Reinforcement Learning

Yalin E. Sagduyu, Tugba Erpek, Kemal Davaslioglu, Sastry Kompella

Main category: cs.LG

TL;DR: RL-based anti-jamming approach using Q-learning and DQN to adapt transmission parameters against reactive jammers.

Details

Motivation: To combat reactive jammers that dynamically select channels and sensing thresholds to detect and jam transmissions, requiring adaptive countermeasures.

Method: Uses reinforcement learning (Q-learning for discrete states, DQN for continuous states) to optimize transmit power, modulation, and channel selection without prior knowledge of channel conditions or jamming strategies.

Result: RL can rapidly adapt to spectrum dynamics and maintain high throughput rates as channels and jamming policies change over time.

Conclusion: Reinforcement learning is effective for mitigating reactive jamming by enabling dynamic adaptation of transmission parameters in response to changing jamming strategies.

Abstract: This paper studies the problem of mitigating reactive jamming, where a jammer adopts a dynamic policy of selecting channels and sensing thresholds to detect and jam ongoing transmissions. The transmitter-receiver pair learns to avoid jamming and optimize throughput over time (without prior knowledge of channel conditions or jamming strategies) by using reinforcement learning (RL) to adapt transmit power, modulation, and channel selection. Q-learning is employed for discrete jamming-event states, while Deep Q-Networks (DQN) are employed for continuous states based on received power. Through different reward functions and action sets, the results show that RL can adapt rapidly to spectrum dynamics and sustain high rates as channels and jamming policies change over time.

[500] Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

Zhaoyi Li, Jingtao Ding, Yong Li, Shihua Li

Main category: cs.LG

TL;DR: The paper proposes a fine-tuning method for Flow Matching (FM) using Maximum Likelihood Estimation to address the train-inference gap and improve performance in high-precision tasks like robotic manipulation.

Details

Motivation: Flow Matching has a train-inference gap where model output cannot be assessed during training, unlike other generative models. This is problematic for high-precision tasks. FM's pursuit of straight paths can also introduce stiffness issues.

Method: Fine-tune FM via Maximum Likelihood Estimation of reconstructions, including straightforward fine-tuning and residual-based approaches. The residual method incorporates contraction property for robustness.

Result: Experimental results in image generation and robotic manipulation show reliable improvement in FM’s inference performance.

Conclusion: The proposed fine-tuning method effectively bridges the train-inference gap in FM and enhances performance for precision-demanding applications.

Abstract: Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model’s output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM’s over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM’s underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model’s robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.

[501] Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

Main category: cs.LG

TL;DR: The paper addresses issues with evaluating uncertainty estimation methods for detecting LLM hallucinations, proposing more robust evaluation approaches including multiple LLM judges, structured tasks, and Elo ratings.

Details

Motivation: Current methods for evaluating uncertainty estimation in LLMs have substantial disagreement and can be manipulated to inflate performance, undermining reliable detection of confabulations.

Method: Proposes using multiple LLM-as-a-judge variants, structured tasks, out-of-distribution detection, perturbation tasks, and Elo rating systems to create more robust evaluation frameworks.

Result: Marginalizing over multiple LLM judges reduces evaluation biases, and alternative risk indicators provide more controllable and robust assessment of uncertainty estimation methods.

Conclusion: The proposed evaluation framework offers more objective and robust assessment of uncertainty estimation algorithms for detecting LLM hallucinations across diverse settings.

Abstract: Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.

[502] Learning Model Representations Using Publicly Available Model Hubs

Damian Falk, Konstantin Schürholt, Konstantinos Tzevelekakis, Léo Meynent, Damian Borth

Main category: cs.LG

TL;DR: The paper proposes a weight space learning method that can be trained on unstructured model repositories like Hugging Face instead of requiring curated model zoos, achieving strong performance and generalization.

Details

Motivation: Current weight space learning requires large, curated model zoos which are computationally expensive to create and limit scale and flexibility. The authors aim to overcome this limitation by using heterogeneous models from unstructured repositories.

Method: Proposed a new weight space backbone designed to handle unstructured model populations with varying architectures, datasets, and documentation from sources like Hugging Face.

Result: Weight space representations trained on Hugging Face models achieve strong performance, often outperforming backbones trained on laboratory-generated model zoos, and generalize to unseen data modalities due to model diversity.

Conclusion: Curated model zoos are not indispensable for weight space learning, as high-quality representations can be learned from unstructured model repositories, overcoming a major limitation in the field.

Abstract: The weights of neural networks have emerged as a novel data modality, giving rise to the field of weight space learning. A central challenge in this area is that learning meaningful representations of weights typically requires large, carefully constructed collections of trained models, typically referred to as model zoos. These model zoos are often trained ad-hoc, requiring large computational resources, constraining the learned weight space representations in scale and flexibility. In this work, we drop this requirement by training a weight space learning backbone on arbitrary models downloaded from large, unstructured model repositories such as Hugging Face. Unlike curated model zoos, these repositories contain highly heterogeneous models: they vary in architecture and dataset, and are largely undocumented. To address the methodological challenges posed by such heterogeneity, we propose a new weight space backbone designed to handle unstructured model populations. We demonstrate that weight space representations trained on models from Hugging Face achieve strong performance, often outperforming backbones trained on laboratory-generated model zoos. Finally, we show that the diversity of the model weights in our training set allows our weight space model to generalize to unseen data modalities. By demonstrating that high-quality weight space representations can be learned in the wild, we show that curated model zoos are not indispensable, thereby overcoming a strong limitation currently faced by the weight space learning community.

[503] Diffusion Models and the Manifold Hypothesis: Log-Domain Smoothing is Geometry Adaptive

Tyler Farghly, Peter Potaptchik, Samuel Howard, George Deligiannidis, Jakiw Pidstrigach

Main category: cs.LG

TL;DR: This paper investigates why diffusion models generalize well, providing evidence that their success stems from adapting to low-dimensional geometric data structures through score matching and implicit regularization via smoothing.

Details

Motivation: To understand the mechanisms behind diffusion models' strong generalization capabilities, particularly testing the conjecture that they adapt to low-dimensional geometric structures in data.

Method: Theoretical and empirical analysis of how smoothing minimizers of the empirical score matching objective produces tangential smoothing along the data manifold, and how this smoothing can be controlled.

Result: Confirmed that smoothing the score function (equivalent to smoothing in log-density domain) produces tangential smoothing to the data manifold, and that the generalization manifold can be controlled by choosing appropriate smoothing.

Conclusion: The study provides evidence supporting the manifold hypothesis for diffusion models’ success, showing that implicit regularization through score matching enables adaptation to low-dimensional geometric data structures.

Abstract: Diffusion models have achieved state-of-the-art performance, demonstrating remarkable generalisation capabilities across diverse domains. However, the mechanisms underpinning these strong capabilities remain only partially understood. A leading conjecture, based on the manifold hypothesis, attributes this success to their ability to adapt to low-dimensional geometric structure within the data. This work provides evidence for this conjecture, focusing on how such phenomena could result from the formulation of the learning problem through score matching. We inspect the role of implicit regularisation by investigating the effect of smoothing minimisers of the empirical score matching objective. Our theoretical and empirical results confirm that smoothing the score function – or equivalently, smoothing in the log-density domain – produces smoothing tangential to the data manifold. In addition, we show that the manifold along which the diffusion model generalises can be controlled by choosing an appropriate smoothing.

[504] PENEX: AdaBoost-Inspired Neural Network Regularization

Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach

Main category: cs.LG

TL;DR: PENEX is a new multi-class exponential loss formulation that enables first-order optimization, implicitly maximizes margins, and shows better regularization than established methods in computer vision and language tasks.

Details

Motivation: AdaBoost uses exponential loss that penalizes mislabeled points severely but generalizes well with many weak learners. The existing exponential loss formulation is not amenable to first-order optimization methods.

Method: Introduces Penalized Exponential Loss (PENEX), a theoretically grounded multi-class exponential loss formulation that can be optimized via first-order methods. Gradient increments on PENEX implicitly parameterize weak learners in boosting framework.

Result: PENEX implicitly maximizes margins of data points and exhibits better regularizing effect than established methods with similar computational cost across computer vision and language tasks.

Conclusion: PENEX has potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.

Abstract: AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes mislabeled data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods. We demonstrate both empirically and theoretically that PENEX implicitly maximizes margins of data points. Also, we show that gradient increments on PENEX implicitly parameterize weak learners in the boosting framework. Across computer vision and language tasks, we show that PENEX exhibits a regularizing effect often better than established methods with similar computational cost. Our results highlight PENEX’s potential as an AdaBoost-inspired alternative for effective training and fine-tuning of deep neural networks.

[505] Hybrid Deep Learning Modeling Approach to Predict Natural Gas Consumption of Home Subscribers on Limited Data

Milad Firoozeh, Nader Dashti, Mohammad Ali Hatefi

Main category: cs.LG

TL;DR: This study analyzes and predicts residential gas consumption in Zanjan province, Iran using machine learning models including LSTM, GRU, and a hybrid BiLSTM-XGBoost model, finding that the hybrid model performs best.

Details

Motivation: Iran faces gas pressure drops and outages during cold seasons due to population growth and high energy consumption, particularly in the residential sector which has the largest consumption share. There is a need to control and predict gas consumption to manage resources efficiently.

Method: Used machine learning models (LSTM, GRU, and hybrid BiLSTM-XGBoost) trained on six years of gas consumption and meteorological data (2017-2022) to predict residential gas consumption patterns.

Result: The hybrid BiLSTM-XGBoost model outperformed other models with lower RMSE, MAPE, and MPE values, showing robust performance especially with limited data.

Conclusion: Machine learning approaches, particularly hybrid models, can effectively predict gas consumption and help manage resources more efficiently, reducing seasonal shortages. Geographical and climatic factors are important considerations in predictive modeling.

Abstract: Today, natural gas, as a clean fuel and the best alternative to crude oil, covers a significant part of global demand. Iran is one of the largest countries with energy resources and in terms of gas is the second-largest country in the world. But, due to the increase in population and energy consumption, it faces problems such as pressure drops and gas outages yearly in cold seasons and therefore it is necessary to control gas consumption, especially in the residential sector, which has the largest share in Iran. This study aims to analyze and predict gas consumption for residential customers in Zanjan province, Iran, using machine learning models, including LSTM, GRU, and a hybrid BiLSTM-XGBoost model. The dataset consists of gas consumption and meteorology data collected over six years, from 2017 to 2022. The models were trained and evaluated based on their ability to accurately predict consumption patterns. The results indicate that the hybrid BiLSTM-XGBoost model outperformed the other models in terms of accuracy, with lower Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE) values, and Mean Percentage Error (MPE). Additionally, the Hybrid model demonstrated robust performance, particularly in scenarios with limited data. The findings suggest that machine learning approaches, particularly hybrid models, can be effectively utilized to manage and predict gas consumption, contributing to more efficient resource management and reducing seasonal shortages. This study highlights the importance of incorporating geographical and climatic factors in predictive modeling, as these significantly influence gas usage across different regions.

[506] Ensemble Threshold Calibration for Stable Sensitivity Control

John N. Daras

Main category: cs.LG

TL;DR: An end-to-end framework for precise recall control in spatial conflation tasks that achieves exact recall with sub-percent variance using neural ranking and ensemble threshold estimation.

Details

Motivation: Precise recall control is critical in large-scale spatial conflation where missing true matches breaks downstream analytics and excessive manual review inflates costs. Classical confidence-interval methods overshoot targets and have high variance.

Method: Pipeline with equigrid bounding-box filter and CSR candidate representation reduces pairs by 100x. Deterministic xxHash bootstrap trains neural ranker, scores propagated to all pairs. Four threshold estimators (Clopper-Pearson, Jeffreys, Wilson, exact quantile) aggregated via inverse-variance weighting across nine subsamples.

Result: Consistently hits recall target within small error, decreases redundant verifications relative to other calibrations, runs end-to-end on single TPU v3 core. Evaluated on 6.31M and 67.34M cadastral pairs.

Conclusion: The framework achieves exact recall with sub-percent variance while being TPU-friendly, significantly improving over classical methods in precision and computational efficiency.

Abstract: Precise recall control is critical in large-scale spatial conflation and entity-matching tasks, where missing even a few true matches can break downstream analytics, while excessive manual review inflates cost. Classical confidence-interval cuts such as Clopper-Pearson or Wilson provide lower bounds on recall, but they routinely overshoot the target by several percentage points and exhibit high run-to-run variance under skewed score distributions. We present an end-to-end framework that achieves exact recall with sub-percent variance over tens of millions of geometry pairs, while remaining TPU-friendly. Our pipeline starts with an equigrid bounding-box filter and compressed sparse row (CSR) candidate representation, reducing pair enumeration by two orders of magnitude. A deterministic xxHash bootstrap sample trains a lightweight neural ranker; its scores are propagated to all remaining pairs via a single forward pass and used to construct a reproducible, score-decile-stratified calibration set. Four complementary threshold estimators - Clopper-Pearson, Jeffreys, Wilson, and an exact quantile - are aggregated via inverse-variance weighting, then fused across nine independent subsamples. This ensemble reduces threshold variance compared to any single method. Evaluated on two real cadastral datasets (approximately 6.31M and 67.34M pairs), our approach consistently hits a recall target within a small error, decreases redundant verifications relative to other calibrations, and runs end-to-end on a single TPU v3 core.

[507] DAG DECORation: Continuous Optimization for Structure Learning under Hidden Confounding

Samhita Pal, James O’quinn, Kaveh Aryan, Heather Pua, James P. Long, Amir Asiaee

Main category: cs.LG

TL;DR: DECOR is a joint estimator for learning DAGs and correlated noise models in linear Gaussian SEMs with latent confounding, providing identifiability guarantees under bow-free graphs and uniform eigenvalue conditions.

Details

Motivation: Existing methods either assume independent errors or require pervasive factor structure/nonlinearity for deconfounding. DECOR addresses the need for a unified approach that handles latent confounding without restrictive assumptions.

Method: A fully differentiable likelihood-based estimator that alternates between smooth-acyclic graph updates and convex noise updates, optionally with bow complementarity penalty or post hoc reconciliation.

Result: DECOR matches or outperforms baselines on synthetic benchmarks across varying confounding density, graph density, latent rank, and dimension, showing particular robustness under non-pervasive confounding.

Conclusion: DECOR provides a theoretically grounded and practically effective solution for structure learning in linear Gaussian SEMs with latent confounding, with strong identifiability guarantees and competitive performance across diverse settings.

Abstract: We study structure learning for linear Gaussian SEMs in the presence of latent confounding. Existing continuous methods excel when errors are independent, while deconfounding-first pipelines rely on pervasive factor structure or nonlinearity. We propose \textsc{DECOR}, a single likelihood-based and fully differentiable estimator that jointly learns a DAG and a correlated noise model. Our theory gives simple sufficient conditions for global parameter identifiability: if the mixed graph is bow free and the noise covariance has a uniform eigenvalue margin, then the map from $(\B,\OmegaMat)$ to the observational covariance is injective, so both the directed structure and the noise are uniquely determined. The estimator alternates a smooth-acyclic graph update with a convex noise update and can include a light bow complementarity penalty or a post hoc reconciliation step. On synthetic benchmarks that vary confounding density, graph density, latent rank, and dimension with $n<p$, \textsc{DECOR} matches or outperforms strong baselines and is especially robust when confounding is non-pervasive, while remaining competitive under pervasiveness.

[508] Catalyst GFlowNet for electrocatalyst design: A hydrogen evolution reaction case study

Lena Podina, Christina Humer, Alexandre Duval, Victor Schmidt, Ali Ramlaoui, Shahana Chatterjee, Yoshua Bengio, Alex Hernandez-Garcia, David Rolnick, Félix Therrien

Main category: cs.LG

TL;DR: Catalyst GFlowNet is a generative model that uses ML predictors to design efficient catalysts for hydrogen energy storage, successfully identifying platinum as optimal for hydrogen evolution reaction and aiming to extend to oxygen evolution reaction.

Details

Motivation: Need for affordable and high-performance catalysts to enable efficient hydrogen energy storage from renewable sources, addressing the challenge of developing cost-effective electrocatalysts.

Method: Catalyst GFlowNet generative model that leverages machine learning-based predictors of formation and adsorption energy to design crystal surfaces as catalysts.

Result: Successfully identified platinum as the most efficient known catalyst for hydrogen evolution reaction through proof-of-concept application.

Conclusion: The generative modeling framework offers a promising pathway for accelerating the search for novel and efficient catalysts, with plans to extend to oxygen evolution reaction and discover new materials.

Abstract: Efficient and inexpensive energy storage is essential for accelerating the adoption of renewable energy and ensuring a stable supply, despite fluctuations in sources such as wind and solar. Electrocatalysts play a key role in hydrogen energy storage (HES), allowing the energy to be stored as hydrogen. However, the development of affordable and high-performance catalysts for this process remains a significant challenge. We introduce Catalyst GFlowNet, a generative model that leverages machine learning-based predictors of formation and adsorption energy to design crystal surfaces that act as efficient catalysts. We demonstrate the performance of the model through a proof-of-concept application to the hydrogen evolution reaction, a key reaction in HES, for which we successfully identified platinum as the most efficient known catalyst. In future work, we aim to extend this approach to the oxygen evolution reaction, where current optimal catalysts are expensive metal oxides, and open the search space to discover new materials. This generative modeling framework offers a promising pathway for accelerating the search for novel and efficient catalysts.

[509] Policy Gradient Guidance Enables Test Time Control

Jianing Qi, Hao Tang, Zhigang Zhu

Main category: cs.LG

TL;DR: Policy Gradient Guidance (PGG) extends classifier-free guidance from diffusion models to policy gradient methods, enabling test-time behavior control without retraining through interpolation of conditional and unconditional policy branches.

Details

Motivation: To bring the benefits of guidance mechanisms from diffusion models to classical reinforcement learning methods, enabling controllable behavior modulation at test time without additional training.

Method: PGG augments policy gradients with an unconditional branch and interpolates between conditional and unconditional branches. The method shows that the normalization term vanishes under advantage estimation, resulting in a clean guided policy gradient update.

Result: Evaluation on discrete and continuous control benchmarks shows that conditioning dropout helps in simple discrete tasks and low sample regimes, while modestly larger guidance (γ>1) consistently improves stability, sample efficiency, and controllability.

Conclusion: Guidance mechanisms previously limited to diffusion policies can be successfully adapted to standard on-policy reinforcement learning methods, opening new directions for controllable online RL.

Abstract: We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($\gamma>1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.

[510] Reinforcement Learning with Action-Triggered Observations

Alexander Ryabchenko, Wenlong Mou

Main category: cs.LG

TL;DR: This paper introduces Action-Triggered Sporadically Traceable MDPs (ATST-MDPs) where state observations occur stochastically based on actions, and proposes ST-LSVI-UCB algorithm that achieves efficient learning with sporadic observations.

Details

Motivation: Many real-world applications have constraints where state observations are stochastically triggered by actions rather than being available at every step, creating a need for reinforcement learning methods that can handle such sporadic observation patterns.

Method: The paper derives Bellman optimality equations for ATST-MDPs, introduces action-sequence learning where agents commit to action sequences until next observation, and proposes ST-LSVI-UCB algorithm - a variant of LSVI-UCB adapted for action-triggered settings with linear MDP assumption.

Result: ST-LSVI-UCB achieves regret Õ(√Kd³(1-γ)⁻³), where K is number of episodes, d is feature dimension, and γ is discount factor, demonstrating efficient learning is feasible under action-triggered observation constraints.

Conclusion: This work establishes the theoretical foundation for learning with sporadic, action-triggered observations and shows that efficient reinforcement learning remains possible even under such observation constraints.

Abstract: We study reinforcement learning problems where state observations are stochastically triggered by actions, a constraint common in many real-world applications. This framework is formulated as Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), where each action has a specified probability of triggering a state observation. We derive tailored Bellman optimality equations for this framework and introduce the action-sequence learning paradigm in which agents commit to executing a sequence of actions until the next observation arrives. Under the linear MDP assumption, value-functions are shown to admit linear representations in an induced action-sequence feature map. Leveraging this structure, we propose off-policy estimators with statistical error guarantees for such feature maps and introduce ST-LSVI-UCB, a variant of LSVI-UCB adapted for action-triggered settings. ST-LSVI-UCB achieves regret $\widetilde O(\sqrt{Kd^3(1-\gamma)^{-3}})$, where $K$ is the number of episodes, $d$ the feature dimension, and $\gamma$ the discount factor (per-step episode non-termination probability). Crucially, this work establishes the theoretical foundation for learning with sporadic, action-triggered observations while demonstrating that efficient learning remains feasible under such observation constraints.

[511] Flatness-Aware Stochastic Gradient Langevin Dynamics

Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

Main category: cs.LG

TL;DR: fSGLD is a novel optimization method that efficiently finds flat minima in deep learning by using random weight perturbation to capture curvature information, providing theoretical guarantees and superior generalization with computational efficiency.

Details

Motivation: Classical SGLD lacks mechanisms to bias optimization toward flat minima, which are crucial for generalization in deep learning. Random weight perturbation shows empirical benefits but lacks rigorous theoretical explanation.

Method: Flatness-Aware SGLD (fSGLD) uses stochastic gradients evaluated at parameters perturbed by isotropic Gaussian noise (Random Weight Perturbation), optimizing a randomized-smoothing objective that implicitly captures Hessian curvature information.

Result: Theoretical analysis shows fSGLD’s invariant measure concentrates on global minimizers of Hessian-trace regularized loss. Experiments demonstrate superior generalization and robustness on noisy-label and vision tasks, with computational cost similar to SGD (half of SAM), and Hessian-spectrum analysis confirms convergence to significantly flatter minima.

Conclusion: fSGLD provides a theoretically grounded method for finding flat minima with proven convergence guarantees, explaining the benefits of random weight perturbation while maintaining computational efficiency comparable to standard SGD.

Abstract: Generalization in deep learning is closely tied to the pursuit of flat minima in the loss landscape, yet classical Stochastic Gradient Langevin Dynamics (SGLD) offers no mechanism to bias its dynamics toward such low-curvature solutions. This work introduces Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), designed to efficiently and provably seek flat minima in high-dimensional nonconvex optimization problems. At each iteration, fSGLD uses the stochastic gradient evaluated at parameters perturbed by isotropic Gaussian noise, commonly referred to as Random Weight Perturbation (RWP), thereby optimizing a randomized-smoothing objective that implicitly captures curvature information. Leveraging these properties, we prove that the invariant measure of fSGLD stays close to a stationary measure concentrated on the global minimizers of a loss function regularized by the Hessian trace whenever the inverse temperature and the scale of random weight perturbation are properly coupled. This result provides a rigorous theoretical explanation for the benefits of random weight perturbation. In particular, we establish non-asymptotic convergence guarantees in Wasserstein distance with the best known rate and derive an excess-risk bound for the Hessian-trace regularized objective. Extensive experiments on noisy-label and large-scale vision tasks, in both training-from-scratch and fine-tuning settings, demonstrate that fSGLD achieves superior or comparable generalization and robustness to baseline algorithms while maintaining the computational cost of SGD, about half that of SAM. Hessian-spectrum analysis further confirms that fSGLD converges to significantly flatter minima.

[512] Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling

Daniel Gallo Fernández

Main category: cs.LG

TL;DR: Poolformer replaces self-attention with recurrent layers and pooling operations to handle long sequences efficiently, outperforming state-of-the-art models on raw audio data.

Details

Motivation: Self-attention scales quadratically with sequence length, limiting practicality for very long sequences. The paper aims to develop a more efficient alternative.

Method: Poolformer uses recurrent layers instead of self-attention and incorporates pooling operations to reduce sequence length. It’s defined recursively using SkipBlocks containing residual blocks, down-pooling, nested SkipBlocks, up-pooling, and additional residual blocks.

Result: Pooling accelerates training, improves perceptual metrics (FID and IS), prevents overfitting. Long-range dependencies are handled by deep layers while shallow layers handle short-term features. Outperforms SaShiMi and Mamba on raw audio.

Conclusion: Poolformer effectively handles long sequences and shows promise for future applications in text, vision, and multi-modal scenarios where it could process dense representations of images and videos.

Abstract: Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.

[513] C2AL: Cohort-Contrastive Auxiliary Learning for Large-scale Recommendation Systems

Mertcan Cokbas, Ziteng Liu, Zeyi Tao, Chengkai Zhang, Elder Veliz, Qin Huang, Ellie Wen, Huayu Li, Qiang Jin, Murat Duman, Benjamin Au, Guy Lebanon, Sagar Chordia

Main category: cs.LG

TL;DR: The paper proposes C2AL, a method that uses auxiliary learning with partially conflicting labels to address heterogeneity in recommendation systems, preventing models from being dominated by central distribution patterns and improving performance on minority cohorts.

Details

Motivation: Real-world recommendation data consists of heterogeneous user cohorts with distinct distributions, but single global objective training assumes homogeneity. Large models tend to focus on central patterns, neglecting head and tail regions, leading to inactive attention weights and limited learning ability.

Method: Analyze dataset substructures and expose those with strong distributional contrast through auxiliary learning. Leverage partially conflicting auxiliary labels to regularize shared representation, customizing attention layers to preserve mutual information with minority cohorts while maintaining global performance.

Result: Evaluation on massive production datasets with billions of data points across six state-of-the-art models showed up to 0.16% reduction in normalized entropy overall, with gains exceeding 0.30% on targeted minority cohorts. The factorization machine captured fine-grained user-ad interactions effectively.

Conclusion: The attention mechanism plays a key role in factorization machines for shared embedding selection. The proposed C2AL method successfully addresses heterogeneity in recommendation systems by preserving minority cohort information while improving global model performance.

Abstract: Training large-scale recommendation models under a single global objective implicitly assumes homogeneity across user populations. However, real-world data are composites of heterogeneous cohorts with distinct conditional distributions. As models increase in scale and complexity and as more data is used for training, they become dominated by central distribution patterns, neglecting head and tail regions. This imbalance limits the model’s learning ability and can result in inactive attention weights or dead neurons. In this paper, we reveal how the attention mechanism can play a key role in factorization machines for shared embedding selection, and propose to address this challenge by analyzing the substructures in the dataset and exposing those with strong distributional contrast through auxiliary learning. Unlike previous research, which heuristically applies weighted labels or multi-task heads to mitigate such biases, we leverage partially conflicting auxiliary labels to regularize the shared representation. This approach customizes the learning process of attention layers to preserve mutual information with minority cohorts while improving global performance. We evaluated C2AL on massive production datasets with billions of data points each for six SOTA models. Experiments show that the factorization machine is able to capture fine-grained user-ad interactions using the proposed method, achieving up to a 0.16% reduction in normalized entropy overall and delivering gains exceeding 0.30% on targeted minority cohorts.

[514] Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

Zeqi Ye, Minshuo Chen

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of diffusion-based transformers for time-series imputation, deriving sample complexity bounds and uncertainty quantification for missing values based on conditional score function approximation.

Details

Motivation: Despite empirical success of diffusion-based imputation methods, there's limited theoretical understanding of how well they capture complex spatio-temporal dependencies between missing and observed values, and how to quantify uncertainty in imputations.

Method: The authors derive statistical sample complexity bounds using novel approximation theory for conditional score functions with transformers, construct confidence regions for missing values, and propose a mixed-masking training strategy.

Result: The analysis reveals that imputation efficiency and accuracy are significantly influenced by missing patterns, and the theoretical insights are validated through simulations.

Conclusion: The work bridges the gap between empirical success and theoretical understanding of diffusion-based imputation, providing statistical guarantees and uncertainty quantification while showing the impact of missing patterns on performance.

Abstract: Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.

[515] Efficiently Generating Correlated Sample Paths from Multi-step Time Series Foundation Models

Ethan Baron, Boris Oreshkin, Ruijun Ma, Hanyu Zhang, Kari Torkkola, Michael W. Mahoney, Andrew Gordon Wilson, Tatiana Konstantinova

Main category: cs.LG

TL;DR: A copula-based method for generating correlated multi-step forecast sample paths from time series foundation models in one forward pass, avoiding expensive autoregressive sampling.

Details

Motivation: Existing time series foundation models only predict independent marginal distributions per time step, lacking joint predictive distributions needed for realistic sample paths with correlation structures. Autoregressive sampling is computationally expensive.

Method: Copula-based approach that generates correlated sample paths from existing multi-step time series foundation models in a single forward pass.

Result: Generates correlated sample paths orders of magnitude faster than autoregressive sampling and improves sample path quality by mitigating snowballing error.

Conclusion: The copula-based approach provides an efficient and effective solution for generating realistic correlated forecast trajectories from time series foundation models.

Abstract: Many time series applications require access to multi-step forecast trajectories in the form of sample paths. Recently, time series foundation models have leveraged multi-step lookahead predictions to improve the quality and efficiency of multi-step forecasts. However, these models only predict independent marginal distributions for each time step, rather than a full joint predictive distribution. To generate forecast sample paths with realistic correlation structures, one typically resorts to autoregressive sampling, which can be extremely expensive. In this paper, we present a copula-based approach to efficiently generate accurate, correlated sample paths from existing multi-step time series foundation models in one forward pass. Our copula-based approach generates correlated sample paths orders of magnitude faster than autoregressive sampling, and it yields improved sample path quality by mitigating the snowballing error phenomenon.

[516] xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity

Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter

Main category: cs.LG

TL;DR: xLSTM scales better than Transformers in LLM training and inference, with advantages increasing with longer contexts.

Details

Motivation: To compare scaling behavior between Transformers and xLSTM architectures, especially regarding context length dependence which was previously overlooked.

Method: Comparative study using IsoFLOP and parametric fit approaches across model sizes (80M-7B) and training tokens (2B-2T), analyzing compute-optimal regimes, context length dependence, and inference-time scaling.

Result: xLSTM scales favorably compared to Transformers in typical LLM scenarios, with widening advantages as training and inference contexts grow.

Conclusion: xLSTM’s linear complexity and favorable scaling make it a promising alternative to Transformers, especially for longer context applications.

Abstract: Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typical LLM training and inference scenarios, xLSTM scales favorably compared to Transformers. Importantly, xLSTM’s advantage widens as training and inference contexts grow.

[517] PUL-Inter-slice Defender: An Anomaly Detection Solution for Distributed Slice Mobility Attacks

Ricardo Misael Ayala Molina, Hyame Assem Alameddine, Makan Pourzandi, Chadi Assi

Main category: cs.LG

TL;DR: PUL-Inter-Slice Defender is a machine learning-based anomaly detection system that protects 5G networks against Distributed Slice Mobility (DSM) DDoS attacks using Positive Unlabeled Learning with LSTM Autoencoders and K-Means clustering.

Details

Motivation: To secure 5G networks against DSM attacks that exploit the Inter-Slice Switching capability, which allows seamless switching between Network Slices but creates vulnerabilities for DDoS attacks.

Method: Uses Positive Unlabeled Learning (PUL) combining Long Short-Term Memory Autoencoders and K-Means clustering, leveraging 3GPP KPIs and performance counters as features to detect DSM attack variants in contaminated training data.

Result: Achieved F1-scores exceeding 98.50% on datasets with 10% to 40% attack contamination, outperforming Inter-Slice Defender and other PUL-based solutions using OCSVM with Random Forest and XGBoost.

Conclusion: PUL-Inter-Slice Defender provides an effective and robust solution for detecting DSM attacks in 5G networks, maintaining high performance even with contaminated training data.

Abstract: Network Slices (NSs) are virtual networks operating over a shared physical infrastructure, each designed to meet specific application requirements while maintaining consistent Quality of Service (QoS). In Fifth Generation (5G) networks, User Equipment (UE) can connect to and seamlessly switch between multiple NSs to access diverse services. However, this flexibility, known as Inter-Slice Switching (ISS), introduces a potential vulnerability that can be exploited to launch Distributed Slice Mobility (DSM) attacks, a form of Distributed Denial of Service (DDoS) attack. To secure 5G networks and their NSs against DSM attacks, we present in this work, PUL-Inter-Slice Defender; an anomaly detection solution that leverages Positive Unlabeled Learning (PUL) and incorporates a combination of Long Short-Term Memory Autoencoders and K-Means clustering. PUL-Inter-Slice Defender leverages the Third Generation Partnership Project (3GPP) key performance indicators and performance measurement counters as features for its machine learning models to detect DSM attack variants while maintaining robustness in the presence of contaminated training data. When evaluated on data collected from our 5G testbed based on the open-source free5GC and UERANSIM, a UE/ Radio Access Network (RAN) simulator; PUL-Inter-Slice Defender achieved F1-scores exceeding 98.50% on training datasets with 10% to 40% attack contamination, consistently outperforming its counterpart Inter-Slice Defender and other PUL based solutions combining One-Class Support Vector Machine (OCSVM) with Random Forest and XGBoost.

[518] Drop-Muon: Update Less, Converge Faster

Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik

Main category: cs.LG

TL;DR: Drop-Muon challenges the conventional full-network update approach by proposing randomized progressive training that updates only a subset of layers per step, achieving faster convergence with theoretical guarantees.

Details

Motivation: To challenge the assumption that all layers must be updated at every step in deep learning optimization, showing that full-network updates can be fundamentally suboptimal.

Method: Introduces Drop-Muon, a non-Euclidean Randomized Progressive Training method that updates only a subset of layers per step according to a randomized schedule, combining progressive training efficiency with layer-specific non-Euclidean updates.

Result: Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to 1.4× faster in wall-clock time, with rigorous convergence guarantees under layer-wise smoothness conditions.

Conclusion: Full-network updates are not optimal unless specific layer smoothness relationships hold, suggesting a shift in large-scale model training towards more efficient progressive approaches.

Abstract: Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

[519] Transformers Discover Molecular Structure Without Graph Priors

Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan

Main category: cs.LG

TL;DR: Transformers trained directly on Cartesian coordinates can achieve competitive molecular energy and force predictions without predefined graphs or physical priors, challenging the necessity of graph neural networks’ hard-coded biases.

Details

Motivation: GNNs use predefined graphs with fixed receptive fields, which limits expressivity and slows inference. The authors investigate whether pure Transformers without graph structures can effectively model molecular energies and forces.

Method: Train standard Transformers directly on Cartesian coordinates without predefined graphs or physical priors, using a matched training compute budget compared to state-of-the-art equivariant GNNs on the OMOl25 dataset.

Result: Transformers achieve competitive energy and force mean absolute errors, learn physically consistent patterns like distance-dependent attention weights, and demonstrate predictable scaling improvements consistent with empirical scaling laws.

Conclusion: Many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.

Abstract: Graph Neural Networks (GNNs) are the dominant architecture for molecular machine learning, particularly for molecular property prediction and machine learning interatomic potentials (MLIPs). GNNs perform message passing on predefined graphs often induced by a fixed radius cutoff or k-nearest neighbor scheme. While this design aligns with the locality present in many molecular tasks, a hard-coded graph can limit expressivity due to the fixed receptive field and slows down inference with sparse graph operations. In this work, we investigate whether pure, unmodified Transformers trained directly on Cartesian coordinates$\unicode{x2013}$without predefined graphs or physical priors$\unicode{x2013}$can approximate molecular energies and forces. As a starting point for our analysis, we demonstrate how to train a Transformer to competitive energy and force mean absolute errors under a matched training compute budget, relative to a state-of-the-art equivariant GNN on the OMol25 dataset. We discover that the Transformer learns physically consistent patterns$\unicode{x2013}$such as attention weights that decay inversely with interatomic distance$\unicode{x2013}$and flexibly adapts them across different molecular environments due to the absence of hard-coded biases. The use of a standard Transformer also unlocks predictable improvements with respect to scaling training resources, consistent with empirical scaling laws observed in other domains. Our results demonstrate that many favorable properties of GNNs can emerge adaptively in Transformers, challenging the necessity of hard-coded graph inductive biases and pointing toward standardized, scalable architectures for molecular modeling.

[520] Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps

Kyoungjun Park, Yifan Yang, Changhan Ge, Lili Qiu, Shiqi Jiang

Main category: cs.LG

TL;DR: Diffusion^2 is a diffusion-based method that uses 3D point clouds to accurately model RF signal propagation across various frequencies with 1.9 dB error and 27x speed improvement.

Details

Motivation: RF signals provide valuable environmental insights beyond RGB cameras, but accurate prediction is challenging due to complex interactions with obstacles like absorption and reflection.

Method: Uses diffusion-based approach with RF-3D Encoder to capture RF-related features from 3D point clouds, employing multi-scale embedding to simulate RF signal dissemination.

Result: Achieves 1.9 dB error margin and 27x faster performance than existing methods across various frequency bands and environmental conditions.

Conclusion: Diffusion^2 represents a significant advancement in RF signal propagation modeling, enabling accurate and efficient prediction for wireless applications.

Abstract: Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field. Refer to https://rfvision-project.github.io/ for more information.

[521] Fine-Grained Urban Traffic Forecasting on Metropolis-Scale Road Networks

Fedor Velikonivtsev, Oleg Platonov, Gleb Bazhenov, Liudmila Prokhorenkova

Main category: cs.LG

TL;DR: The paper introduces new large-scale urban traffic datasets and a scalable GNN approach for traffic forecasting that outperforms existing methods.

Details

Motivation: Current traffic forecasting benchmarks have limitations including small scale, lack of road connectivity information, and focus on intercity highways rather than complex urban networks.

Method: Proposed a GNN approach without dedicated temporal sequence processing modules for better scalability, and released large urban traffic datasets with nearly 100,000 road segments.

Result: The new datasets are 10x larger than existing ones, and the proposed GNN method achieves better scalability and forecasting performance compared to current neural spatiotemporal models.

Conclusion: The datasets and modeling approach provide a more realistic and challenging benchmark for traffic forecasting research, addressing scalability issues in current methods.

Abstract: Traffic forecasting on road networks is a complex task of significant practical importance that has recently attracted considerable attention from the machine learning community, with spatiotemporal graph neural networks (GNNs) becoming the most popular approach. The proper evaluation of traffic forecasting methods requires realistic datasets, but current publicly available benchmarks have significant drawbacks, including the absence of information about road connectivity for road graph construction, limited information about road properties, and a relatively small number of road segments that falls short of real-world applications. Further, current datasets mostly contain information about intercity highways with sparsely located sensors, while city road networks arguably present a more challenging forecasting task due to much denser roads and more complex urban traffic patterns. In this work, we provide a more complete, realistic, and challenging benchmark for traffic forecasting by releasing datasets representing the road networks of two major cities, with the largest containing almost 100,000 road segments (more than a 10-fold increase relative to existing datasets). Our datasets contain rich road features and provide fine-grained data about both traffic volume and traffic speed, allowing for building more holistic traffic forecasting systems. We show that most current implementations of neural spatiotemporal models for traffic forecasting have problems scaling to datasets of our size. To overcome this issue, we propose an alternative approach to neural traffic forecasting that uses a GNN without a dedicated module for temporal sequence processing, thus achieving much better scalability, while also demonstrating stronger forecasting performance. We hope our datasets and modeling insights will serve as a valuable resource for research in traffic forecasting.

[522] Knowledge Distillation Detection for Open-weights Models

Qin Shi, Amber Yijia Zheng, Qifan Song, Raymond A. Yeh

Main category: cs.LG

TL;DR: Proposes knowledge distillation detection to identify if a student model was distilled from a teacher model, using only student weights and teacher API access.

Details

Motivation: Addresses growing concerns about model provenance and unauthorized replication through distillation.

Method: Model-agnostic framework combining data-free input synthesis and statistical score computation for detecting distillation.

Result: Improves detection accuracy over strongest baselines by 59.6% on CIFAR-10, 71.2% on ImageNet, and 20.0% for text-to-image generation.

Conclusion: The proposed method effectively detects knowledge distillation across diverse architectures for both classification and generative models.

Abstract: We propose the task of knowledge distillation detection, which aims to determine whether a student model has been distilled from a given teacher, under a practical setting where only the student’s weights and the teacher’s API are available. This problem is motivated by growing concerns about model provenance and unauthorized replication through distillation. To address this task, we introduce a model-agnostic framework that combines data-free input synthesis and statistical score computation for detecting distillation. Our approach is applicable to both classification and generative models. Experiments on diverse architectures for image classification and text-to-image generation show that our method improves detection accuracy over the strongest baselines by 59.6% on CIFAR-10, 71.2% on ImageNet, and 20.0% for text-to-image generation. The code is available at https://github.com/shqii1j/distillation_detection.

[523] Robust Tangent Space Estimation via Laplacian Eigenvector Gradient Orthogonalization

Dhruv Kohli, Sawyer J. Robertson, Gal Mishne, Alexander Cloninger

Main category: cs.LG

TL;DR: LEGO is a spectral method that uses global data structure to estimate tangent spaces by orthogonalizing gradients of low-frequency Laplacian eigenvectors, overcoming LPCA’s noise sensitivity in high-noise settings.

Details

Motivation: LPCA struggles with high-noise data due to the critical trade-off in neighborhood size selection, which requires prior knowledge of geometric and noise characteristics that are often unavailable.

Method: LEGO estimates tangent spaces by orthogonalizing gradients of low-frequency eigenvectors of the graph Laplacian, leveraging global data structure rather than relying solely on local neighborhoods.

Result: LEGO provides significantly more robust tangent space estimates to noise compared to LPCA, leading to marked improvements in manifold learning, boundary detection, and local intrinsic dimension estimation.

Conclusion: LEGO offers a theoretically justified and practically effective alternative to LPCA for tangent space estimation in high-noise settings by utilizing global spectral information.

Abstract: Estimating the tangent spaces of a data manifold is a fundamental problem in data analysis. The standard approach, Local Principal Component Analysis (LPCA), struggles in high-noise settings due to a critical trade-off in choosing the neighborhood size. Selecting an optimal size requires prior knowledge of the geometric and noise characteristics of the data that are often unavailable. In this paper, we propose a spectral method, Laplacian Eigenvector Gradient Orthogonalization (LEGO), that utilizes the global structure of the data to guide local tangent space estimation. Instead of relying solely on local neighborhoods, LEGO estimates the tangent space at each data point by orthogonalizing the gradients of low-frequency eigenvectors of the graph Laplacian. We provide two theoretical justifications of our method. First, a differential geometric analysis on a tubular neighborhood of a manifold shows that gradients of the low-frequency Laplacian eigenfunctions of the tube align closely with the manifold’s tangent bundle, while an eigenfunction with high gradient in directions orthogonal to the manifold lie deeper in the spectrum. Second, a random matrix theoretic analysis also demonstrates that low-frequency eigenvectors are robust to sub-Gaussian noise. Through comprehensive experiments, we demonstrate that LEGO yields tangent space estimates that are significantly more robust to noise than those from LPCA, resulting in marked improvements in downstream tasks such as manifold learning, boundary detection, and local intrinsic dimension estimation.

[524] KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

Main category: cs.LG

TL;DR: KaVa is a framework that distills knowledge from compressed KV-cache of teacher LLMs into latent-reasoning students via self-distillation, enabling efficient reasoning without explicit chain-of-thought traces.

Details

Motivation: Traditional chain-of-thought reasoning incurs high computational costs and memory overhead, while latent reasoning lacks proper supervision for complex natural-language reasoning tasks.

Method: Uses self-distillation to transfer knowledge from teacher’s compressed KV-cache to latent-reasoning student, leveraging continuous latent tokens to align stepwise KV trajectories.

Result: Consistently outperforms latent baselines, shows smaller degradation from equation-only to natural-language traces, scales to larger backbones while preserving efficiency.

Conclusion: Compressed KV-cache distillation provides scalable supervision for latent reasoning, combining CoT accuracy with latent inference efficiency and deployability.

Abstract: Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

[525] AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

Main category: cs.LG

TL;DR: AutoScale is a two-stage framework for optimizing data composition in LLM pre-training that adapts to scale changes, achieving faster convergence and better performance than traditional methods.

Details

Motivation: Existing practices of determining optimal data mixtures at small scales and directly applying them to larger scales are ineffective, as optimal compositions change with training scale.

Method: Two-stage approach: 1) Fit parametric model to predict loss under different data compositions at smaller budgets, 2) Use theoretical analysis to extrapolate optimal composition to larger scales without retraining.

Result: 28% faster perplexity reduction than baselines, up to 38% speed-up over unweighted training, and best-average performance on downstream tasks when pre-training GPT-2 Large.

Conclusion: Domain importance shifts with training scale, highlighting the necessity for scale-dependent data curation in LLM training.

Abstract: Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model’s loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.

[526] QSpec: Speculative Decoding with Complementary Quantization Schemes

Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu

Main category: cs.LG

TL;DR: QSpec is a novel quantization paradigm that combines low-precision joint quantization for fast drafting with high-precision weight-only quantization for accurate verification using speculative decoding, achieving up to 1.64x speedup without quality degradation.

Details

Motivation: Activation-weight joint quantization suffers from substantial performance degradation on multi-step reasoning tasks, creating a need for a solution that maintains efficiency while preserving quality.

Method: QSpec decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification, reusing weights and KV cache across stages.

Result: QSpec achieves up to 1.64x speedup without quality degradation compared to high-precision baselines, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings.

Conclusion: QSpec provides a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios, supporting plug-and-play deployment and generalizing well across model scales, quantization methods, and workloads.

Abstract: Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers from substantial performance degradation on multi-step reasoning tasks. We propose QSpec, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSpec achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSpec supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSpec a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios. Our code is available at https://github.com/hku-netexplo-lab/QSpec.

[527] Unraveling Indirect In-Context Learning Using Influence Functions

Hadi Askari, Shivanshu Gupta, Terry Tong, Fei Wang, Anshuman Chhabra, Muhao Chen

Main category: cs.LG

TL;DR: This paper introduces Indirect In-Context Learning (ICL), using Influence Functions (IFs) for demonstration selection in Mixture of Tasks and Noisy ICL scenarios, showing improved performance over traditional methods.

Details

Motivation: To address limitations of traditional ICL in real-world scenarios where demonstrations come from multiple tasks or contain noise/mislabeling, requiring more robust selection strategies.

Method: Proposed Indirect ICL using Influence Functions (IFs) for demonstration selection. Evaluated in two settings: Mixture of Tasks (28 diverse tasks) and Noisy ICL (mislabeled/adversarial examples). Combined IFs with traditional selectors like BertScore-Recall and Cosine Similarity.

Result: For Mixture of Tasks: 0.37% and 1.45% accuracy gains for 3-shot and 5-shot setups. For Noisy ICL: 2.90% and 2.94% accuracy boosts on noisy GLUE benchmarks. For adversarial setting: 32.89% reduction in Attack Success Rate compared to task-aware methods.

Conclusion: IFs provide a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights for handling diverse tasks and noisy scenarios in real-world applications.

Abstract: In this work, we introduce a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy ICL. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential of IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can further improve performance, leading to average absolute accuracy gains of 0.37% and 1.45% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy ICL setting, we examine scenarios where demonstrations might be mislabeled or have adversarial noise. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90% for Cosine Similarity and 2.94% for BSR on noisy GLUE benchmarks. For the adversarial sub-setting, we show the utility of using IFs for task-agnostic demonstration selection for backdoor attack mitigation. Showing a 32.89% reduction in Attack Success Rate compared to task-aware methods. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.

[528] Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo

Main category: cs.LG

TL;DR: DPO’s reward design, training dynamics, and downstream capabilities emerge from learning Differential Information Distribution (DID), which represents Bayesian evidence needed to update policies.

Details

Motivation: To understand the theoretical foundations of DPO, including the rationale behind its log-ratio reward, how preference datasets shape training dynamics, and how dynamics impact downstream capabilities.

Method: Introduce Differential Information Distribution (DID) from a Bayesian perspective, interpreting preference optimization as learning evidence to update reference policies into target policies.

Result: DPO’s log-ratio reward is uniquely justified when preferences encode DID; training dynamics stem from power-law DID relationships; high-entropy DID improves instruction-following while low-entropy DID benefits knowledge QA.

Conclusion: DPO’s design and behavior naturally emerge from learning Differential Information, providing both theoretical foundation and practical guidance for preference-based alignment.

Abstract: Direct Preference Optimization (DPO) has been widely used for aligning language models with human preferences in a supervised manner. However, several key questions remain unresolved: the rationale behind its log-ratio reward, how the statistical structure of preference datasets shapes its training dynamics, and how those dynamics impact downstream capabilities. We approach these questions from a Bayesian perspective, interpreting the goal of preference optimization as learning the differential information required to update a reference policy into a target policy. To formalize this view, we introduce the Differential Information Distribution (DID), defined as the distribution over samples that carry the Bayesian evidence required to update policies. We introduce three complementary insights by viewing preference optimization through the DID. First, we find that DPO’s log-ratio reward is uniquely justified when preferences encode the Differential Information needed to update a reference policy into the target policy. Second, we discuss how commonly observed training dynamics in DPO, including changes in log-likelihood and policy exploration, stem from a power-law DID relationship. Finally, we analyze how training dynamics influence downstream performance using the entropy of DID, a principled measure of uncertainty in the learned information. We observe that learning high-entropy DID improves open-ended instruction-following, while low-entropy DID benefits knowledge-intensive QA. Taken together, our results show that DPO’s reward design, training dynamics, and downstream capabilities all emerge as natural consequences of learning Differential Information, offering both a principled theoretical foundation and practical guidance for preference-based alignment.

[529] Forget Forgetting: Continual Learning in a World of Abundant Memory

Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha

Main category: cs.LG

TL;DR: This paper challenges traditional continual learning assumptions by focusing on GPU time rather than memory constraints, proposing Weight Space Consolidation to address the stability-plasticity trade-off in memory-abundant scenarios.

Details

Motivation: Traditional continual learning focuses on minimizing exemplar memory, but modern systems face GPU time as the primary bottleneck. The paper investigates a practical regime where memory is abundant but full retraining remains expensive.

Method: Proposes Weight Space Consolidation: a lightweight method combining rank-based parameter resets to restore plasticity with weight averaging to enhance stability.

Result: Outperforms strong baselines while matching the low computational cost of replay, validated on class-incremental learning with image classifiers and continual instruction tuning with LLMs.

Conclusion: Challenges long-standing CL assumptions and establishes a new, cost-efficient baseline for real-world CL systems where memory is no longer the limiting factor.

Abstract: Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical “middle ground”, we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.

[530] CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Mingsong Yan, Zi Yang, Paul Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, Zheng Zhang

Main category: cs.LG

TL;DR: CoLA replaces full-size MLPs and attention projection layers in LLMs with compute-efficient auto-encoders that enforce low-rank activations, reducing computing cost by 2x and improving throughput by 1.86x while maintaining performance.

Details

Motivation: The activations of pre-trained LLMs exhibit low-rank property, and full-size MLPs/projection layers consume extensive computational resources in pre-training.

Method: Propose CoLA and CoLA-M to replace full-size layers with auto-encoders that naturally enforce low-rank activations throughout training, eliminating activation redundancy.

Result: Experiments on LLaMA models (60M to 7B parameters) show 2x computing cost reduction, 1.86x throughput improvement while maintaining full-rank performance. CoLA-M further reduces memory cost without throughput sacrifice.

Conclusion: CoLA offers superior parameter, computing, and memory efficiency, producing LLMs that are 2x smaller for faster inference with lower memory cost on resource-constrained platforms.

Abstract: The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose CoLA and its memory-efficient implementation, CoLA-M, to replace these full-size layers with compute-efficient auto-encoders that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $\bf 2\pmb{\times}$ and improves training throughput by $\bf 1.86\pmb{\times}$ while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also $\bf 2\pmb{\times}$ smaller, enabling faster inference with lower memory cost on resource-constrained platforms.

[531] Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Flavio P. Calmon, Jamie Hayes, Borja Balle, Antti Honkela

Main category: cs.LG

TL;DR: Current DP reporting practices for ML algorithms are incomplete and misleading. The paper advocates using non-asymptotic Gaussian Differential Privacy (GDP) as the primary communication method, showing it captures DP-SGD’s privacy profile with virtually no error.

Details

Motivation: Standard (ε,δ)-DP reporting provides an incomplete picture and can mislead about actual privacy guarantees, potentially suggesting highly accurate inference attacks that don't exist.

Method: Uses numerical accountants to compute privacy profiles and f-DP curves of DP-SGD with arbitrary accuracy, and applies a decision-theoretic metric over DP representations to provide non-asymptotic GDP bounds.

Result: GDP captures the entire privacy profile of DP-SGD and related algorithms with virtually no error. Validated on state-of-the-art DP image classification and TopDown algorithm for US Census, showing excellent fit in all cases.

Conclusion: GDP is superior for communicating DP guarantees in ML, though the approach has strengths and weaknesses that should be considered, and could benefit other privacy mechanisms.

Abstract: Current practices for reporting the level of differential privacy (DP) protection for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture of the privacy guarantees. For instance, if only a single $(\varepsilon,\delta)$ is known about a mechanism, standard analyses show that there exist highly accurate inference attacks against training data records, when, in fact, such accurate attacks might not exist. In this position paper, we argue that using non-asymptotic Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.

[532] Knowledge-guided machine learning for county-level corn yield prediction under drought

Xiaoyu Wang, Yijia Xu, Jingyi Huang, Zhengwei Yang, Yanbo Huang, Rajat Bindlish, Zhou Zhang

Main category: cs.LG

TL;DR: KGML-SM framework combines process-based models and machine learning for crop yield prediction, incorporating soil moisture as key intermediate variable and using drought-aware loss function to improve accuracy.

Details

Motivation: Address limitations of traditional process-based models (struggle with large RS data) and ML models (black box nature), while incorporating soil moisture's role in corn growth that previous works overlooked.

Method: Knowledge-Guided Machine Learning with Soil Moisture (KGML-SM) framework treating soil moisture as intermediate variable, plus drought-aware loss function that penalizes overestimation in drought conditions.

Result: KGML-SM model outperformed traditional ML models, enabled exploration of drought-soil moisture-yield relationships, and provided interpretability for prediction errors.

Conclusion: KGML-SM successfully bridges gap between process-based and ML models, emphasizes soil moisture’s importance in corn yield prediction, and provides interpretable framework for future optimization.

Abstract: Remote sensing (RS) technique, enabling the non-contact acquisition of extensive ground observations, is a valuable tool for crop yield predictions. Traditional process-based models struggle to incorporate large volumes of RS data, and most users lack understanding of crop growth mechanisms. In contrast, machine learning (ML) models are often criticized as “black boxes” due to their limited interpretability. To address these limitations, we utilized Knowledge-Guided Machine Learning (KGML), a framework that leverages the strengths of both process-based and ML models. Existing works have either overlooked the role of soil moisture in corn growth or did not embed this effect into their models. To bridge this gap, we developed the Knowledge-Guided Machine Learning with Soil Moisture (KGML-SM) framework, treating soil moisture as an intermediate variable in corn growth to emphasize its key role in plant development. Additionally, based on the prior knowledge that the model may overestimate under drought conditions, we designed a drought-aware loss function that penalized predicted yield in drought-affected areas. Our experiments showed that the KGML-SM model outperformed other traditional ML models. We explored the relationships between drought, soil moisture, and corn yield prediction by assessing the importance of different features within the model, and analyzing how soil moisture impacts predictions across different regions and time periods. Finally we provided interpretability for prediction errors to guide future model optimization.

[533] DynaGuard: A Dynamic Guardrail Model With User-Defined Policies

Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein

Main category: cs.LG

TL;DR: Dynamic guardian models that evaluate text based on user-defined policies, offering fast detection or chain-of-thought reasoning, matching static models in accuracy while handling free-form policies efficiently.

Details

Motivation: Standard guardian models only detect predefined static harm categories, limiting their usefulness for different application domains that require custom policies.

Method: Proposed dynamic guardian models that evaluate text based on user-defined policies, supporting both fast detection and chain-of-thought reasoning approaches.

Result: Dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with comparable accuracy to frontier reasoning models, but in a fraction of the time.

Conclusion: Dynamic guardian models provide flexible and efficient policy enforcement for chatbots, extending beyond static harm categories to user-defined policies across different application domains.

Abstract: Guardian models are used to supervise and moderate the outputs of user-facing chatbots, enforcing guardrails and detecting bad behaviors. Standard guardian models like LlamaGuard detect predefined, static categories of harms. We propose dynamic guardian models that evaluate text based on user-defined policies, making them useful for different application domains that are not addressed by standard guardian models. Our dynamic guardian models can be used for fast detection of policy violations or with chain-of-thought reasoning that articulates and justifies the model outputs. Our dynamic guardian models match static models in detection accuracy for static harm categories while identifying violations of free-form policies with accuracy comparable to frontier reasoning models in a fraction of the time.

[534] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Nan Zhang, Eugene Kwek, Yusen Zhang, Ngoc-Hieu Nguyen, Prasenjit Mitra, Rui Zhang

Main category: cs.LG

TL;DR: The paper investigates how compression methods (quantization, distillation, pruning) affect reasoning capabilities in large reasoning models, using DeepSeek-R1 models and mechanistic interpretation techniques.

Details

Motivation: Existing studies fail to sufficiently compare all three compression methods on LRMs and lack in-depth interpretation analysis of how reasoning capabilities are compromised during compression.

Method: Benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets, and use difference of means and attribution patching techniques to interpret fine-grained causal relationships between weights and reasoning capabilities.

Result: Dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. Three main findings: weight count impacts knowledge memorization more than reasoning; MLP up projection in final layer is critical; protecting just 2% of excessively compressed weights can raise accuracy by 6.57%.

Conclusion: The study provides insights into compression effects on reasoning models, identifies critical weight components, and demonstrates that targeted protection of specific weights can significantly improve compressed model performance.

Abstract: Compression methods, including quantization, distillation, and pruning, improve the computational efficiency of large reasoning models (LRMs). However, existing studies either fail to sufficiently compare all three compression methods on LRMs or lack in-depth interpretation analysis. In this paper, we investigate how the reasoning capabilities of LRMs are compromised during compression, through performance benchmarking and mechanistic interpretation. To uncover the effects of compression on reasoning performance, we benchmark quantized, distilled, and pruned DeepSeek-R1 models on four reasoning datasets (AIME 2024, FOLIO, Temporal Sequences, and MuSiQue). To precisely locate compression effects on model weights, we adapt difference of means and attribution patching techniques, focusing on the activation of every linear component in compressed LRMs, to interpret fine-grained causal relationships between weights and various reasoning capabilities. This fine-grained interpretation addresses a fundamental question of compression: which weights are the most important for reasoning? Overall, we find dynamically quantized 2.51-bit R1 reaches close-to-R1 performance. With empirical verification, we present three main findings that generalize across both Llama and Qwen: (1) Weight count has a greater impact on LRMs’ knowledge memorization than reasoning, highlighting the risks of pruning and distillation; (2) The MLP up projection in the final layer of distilled LRMs is one of the most important components, offering a new perspective on locating critical weights - a fundamental problem in model compression; and (3) Current quantization methods overly compress the final-layer modules and MLP gate projections, so protecting just 2% of all weights that are excessively compressed can raise average accuracy by 6.57%, greatly surpassing the state-of-the-art.

[535] AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

Main category: cs.LG

TL;DR: The paper introduces a principled framework for deriving sparse autoencoders (SAEs) from dictionary learning and proposes AbsTopK SAE, a new variant that overcomes limitations of existing SAEs by preserving bidirectional activations.

Details

Motivation: Existing SAE variants lack a principled derivation framework and suffer from structural constraints that enforce non-negativity, preventing features from representing bidirectional concepts and causing semantic fragmentation.

Method: The authors unroll the proximal gradient method for sparse coding to derive SAE variants, revealing that a single-step update recovers common SAEs. They propose AbsTopK SAE based on ℓ₀ sparsity constraint with hard thresholding over largest-magnitude activations.

Result: AbsTopK SAE improves reconstruction fidelity, enhances interpretability, enables single features to encode contrasting concepts, and matches or surpasses supervised Difference-in-Mean method across four LLMs and seven tasks.

Conclusion: The proposed framework provides principled derivation of SAEs, and AbsTopK SAE addresses fundamental limitations of existing approaches by supporting bidirectional concept representation, leading to improved performance and interpretability.

Abstract: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

[536] Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Maxime Méloux, François Portet, Maxime Peyrard

Main category: cs.LG

TL;DR: Mechanistic Interpretability methods like circuit discovery should be treated as statistical estimators and evaluated for stability. Analysis of EAP-IG method shows high variance and sensitivity to hyperparameters, questioning reliability of findings.

Details

Motivation: To establish scientific rigor in Mechanistic Interpretability by treating interpretability methods as statistical estimators subject to variance and robustness evaluation.

Method: Systematic stability analysis of EAP-IG circuit discovery method through controlled perturbations including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise.

Result: EAP-IG exhibits high structural variance and sensitivity to hyperparameters across diverse models and tasks, questioning the stability of its findings.

Conclusion: Advocates for routine reporting of stability metrics to promote more rigorous and statistically grounded interpretability science.

Abstract: The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models’ internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.

[537] The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin

Main category: cs.LG

TL;DR: Classifier-based Quality Filtering (CQF) improves downstream task performance but doesn’t enhance language modeling on high-quality data, challenging the notion that CQF captures meaningful data quality.

Details

Motivation: To understand why CQF improves downstream tasks while failing to enhance language modeling on high-quality datasets, and to examine whether CQF truly captures meaningful data quality.

Method: Analyzed CQF behavior by comparing models trained with CQF to those trained on synthetic data with increasing quality obtained via random token permutations.

Result: CQF implicitly filters the high-quality dataset itself, and shows starkly different trends compared to models trained on synthetic quality data, challenging CQF’s notion of data quality.

Conclusion: CQF doesn’t necessarily capture meaningful data quality despite improving downstream performance, revealing limitations in current data filtering approaches.

Abstract: Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier’s score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

[538] AI-Powered Inverse Design of Ku-Band SIW Resonant Structures by Iterative Residual Correction Network

Mohammad Mashayekhi, Kamran Salehian, Abbas Ozgoli, Saeed Abdollahi, Abdolali Abdipour, Ahmed A. Kishk

Main category: cs.LG

TL;DR: A deep learning framework for inverse design of multi-mode SIW filters with both closely spaced and widely separated resonances, using a three-stage approach that reduces simulation dependency.

Details

Motivation: Addressing the challenge of designing high-performance SIW filters with mixed resonance spacing while reducing reliance on time-consuming electromagnetic simulations.

Method: Three-stage deep learning framework: Feedforward Inverse Model (FIM), Hybrid Inverse-Forward Residual Refinement Network (HiFR²-Net), and Iterative Residual Correction Network (IRC-Net).

Result: IRC-Net achieved best performance with systematic error reduction over five iterations, reducing MSE from 0.00191 to 0.00146 and MAE from 0.0262 to 0.0209.

Conclusion: The framework enables robust, accurate, and generalizable inverse design of complex microwave filters with minimal simulation cost, facilitating rapid prototyping and extending to other high-frequency components.

Abstract: Designing high-performance substrate-integrated waveguide (SIW) filters with both closely spaced and widely separated resonances is challenging. Consequently, there is a growing need for robust methods that reduce reliance on time-consuming electromagnetic (EM) simulations. In this study, a deep learning-based framework was developed and validated for the inverse design of multi-mode SIW filters with both closely spaced and widely separated resonances. A series of SIW filters were designed, fabricated, and experimentally evaluated. A three-stage deep learning framework was implemented, consisting of a Feedforward Inverse Model (FIM), a Hybrid Inverse-Forward Residual Refinement Network (HiFR\textsuperscript{2}-Net), and an Iterative Residual Correction Network (IRC-Net). The design methodology and performance of each model were systematically analyzed. Notably, IRC-Net outperformed both FIM and HiFR\textsuperscript{2}-Net, achieving systematic error reduction over five correction iterations. Experimental results showed a reduction in mean squared error (MSE) from 0.00191 to 0.00146 and mean absolute error (MAE) from 0.0262 to 0.0209, indicating improved accuracy and convergence. The proposed framework demonstrates the capability to enable robust, accurate, and generalizable inverse design of complex microwave filters with minimal simulation cost. This approach is expected to facilitate rapid prototyping of advanced filter designs and could extend to other high-frequency components in microwave and millimeter-wave technologies.

[539] Momentum-SAM: Sharpness Aware Minimization without Computational Overhead

Marlon Becker, Frederick Altrock, Benjamin Risse

Main category: cs.LG

TL;DR: MSAM is a computationally efficient alternative to SAM that uses momentum-based perturbations instead of additional gradient calculations to achieve flat loss regions and reduce overfitting.

Details

Motivation: SAM improves generalization but doubles computational costs due to extra gradient calculations, making it impractical for limited computational resources.

Method: MSAM perturbs parameters in the direction of accumulated momentum vector (inspired by Nesterov Accelerated Gradient) to achieve low sharpness without significant computational overhead.

Result: MSAM achieves low sharpness and improved generalization without the computational burden of SAM, while providing insights into separable mechanisms of NAG, SAM and MSAM.

Conclusion: MSAM offers a computationally efficient approach to achieve flat loss regions and reduce overfitting, making it feasible for resource-constrained environments.

Abstract: The recently proposed optimization algorithm for deep neural networks Sharpness Aware Minimization (SAM) suggests perturbing parameters before gradient calculation by a gradient ascent step to guide the optimization into parameter space regions of flat loss. While significant generalization improvements and thus reduction of overfitting could be demonstrated, the computational costs are doubled due to the additionally needed gradient calculation, making SAM unfeasible in case of limited computationally capacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose Momentum-SAM (MSAM), which perturbs parameters in the direction of the accumulated momentum vector to achieve low sharpness without significant computational overhead or memory demands over SGD or Adam. We evaluate MSAM in detail and reveal insights on separable mechanisms of NAG, SAM and MSAM regarding training optimization and generalization. Code is available at https://github.com/MarlonBecker/MSAM.

[540] Subspace Node Pruning

Joshua Offergeld, Marcel van Gerven, Nasir Ahmad

Main category: cs.LG

TL;DR: A novel node pruning method using orthogonal subspace projection to remove redundant computational units while maintaining performance, with automatic layer-wise pruning ratio determination and significantly lower computational cost than alternatives.

Details

Motivation: To improve neural network inference efficiency by reducing computational overhead while retaining performance, addressing the growing commercial use of AI models.

Method: Projection of unit activations to orthogonal subspaces to identify redundant nodes, optimized orthogonalization ordering to rank redundancy, and automatic layer-wise pruning ratios based on cumulative variance in the subspace.

Result: Matches or exceeds state-of-the-art pruning results on ImageNet-trained VGG-16, ResNet-50 and DeiT models with up to 24x lower computational cost, and successfully applied to OPT LLM models outperforming competing methods.

Conclusion: The proposed orthogonal subspace projection method provides an efficient and effective approach for neural network pruning that significantly reduces inference time while maintaining model performance across various architectures.

Abstract: Improving the efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We furthermore show that the order in which units are orthogonalized can be optimized to maximally rank units by their redundancy. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to cumulative variance. Our method matches or exceeds state-of-the-art pruning results on ImageNet-trained VGG-16, ResNet-50 and DeiT models while simultaneously having up to 24x lower computational cost than alternative methods. We also demonstrate that this method can be applied in a one-shot manner to OPT LLM models, again outperforming competing methods.

[541] Time-o1: Time-Series Forecasting Needs Transformed Label Alignment

Hao Wang, Licheng Pan, Zhichao Chen, Xu Chen, Qingyang Dai, Lei Wang, Haoxuan Li, Zhouchen Lin

Main category: cs.LG

TL;DR: Time-o1 is a novel learning objective for time-series forecasting that transforms labels into decorrelated components to address label autocorrelation and reduce task complexity.

Details

Motivation: Existing methods using temporal mean squared error suffer from label autocorrelation bias and excessive task complexity that grows with forecast horizon.

Method: Transform label sequence into decorrelated components with discriminated significance, then train models to align the most significant components.

Result: Extensive experiments show Time-o1 achieves state-of-the-art performance and is compatible with various forecast models.

Conclusion: Time-o1 effectively mitigates label autocorrelation and reduces task amount while maintaining compatibility with different forecasting models.

Abstract: Training time-series forecast models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimization. To address these challenges, we propose Time-o1, a transformation-augmented learning objective tailored for time-series forecasting. The central idea is to transform the label sequence into decorrelated components with discriminated significance. Models are then trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Extensive experiments demonstrate that Time-o1 achieves state-of-the-art performance and is compatible with various forecast models. Code is available at https://github.com/Master-PLC/Time-o1.

[542] How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook

Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash

Main category: cs.LG

TL;DR: This survey paper provides the first comprehensive review of Multiple Modalities for Time Series Analysis (MM4TSA), exploring how time series analysis can benefit from incorporating other modalities like text, images, and audio.

Details

Motivation: Time series analysis remains relatively underexplored compared to other modalities like language and vision. The paper aims to bridge this gap by surveying how TSA can benefit from multiple modalities.

Method: Systematically reviews MM4TSA works by discussing three main benefits: reusing foundation models from other modalities, multimodal extension for enhanced TSA, and cross-modality interaction for advanced TSA. Groups works by modality type (text, images, audio, tables, etc.).

Result: Identifies key gaps and future opportunities including: reused modality selections, heterogeneous modality combinations, and unseen tasks generalizations. Provides an up-to-date GitHub repository with key papers and resources.

Conclusion: This is the first comprehensive survey of the emerging MM4TSA field, offering a systematic framework for understanding how time series analysis can leverage multiple modalities and identifying important future research directions.

Abstract: Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to “richer” modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.

[543] Weighted Sequential Bayesian Inference for Non-Stationary Linear Contextual Bandits

Nicklas Werge, Yi-Shan Wu, Abdullah Akgül, Melih Kandemir

Main category: cs.LG

TL;DR: The paper introduces Weighted Sequential Bayesian (WSB) methods for non-stationary linear contextual bandits, developing novel concentration inequalities and three efficient algorithms with improved regret guarantees.

Details

Motivation: To address non-stationary linear contextual bandits using Bayesian inference rather than traditional Weighted Regularized Least-Squares (WRLS) approaches, aiming to quantify and leverage the influence of initial beliefs.

Method: Proposes Weighted Sequential Bayesian (WSB) framework that maintains posterior distributions over time-varying reward parameters, with novel concentration inequalities for WSB posteriors. Introduces three algorithms: WSB-LinUCB, WSB-RandLinUCB, and WSB-LinTS.

Result: Established frequentist regret guarantees: WSB-LinUCB matches best-known WRLS-based guarantees, while WSB-RandLinUCB and WSB-LinTS improve upon them, all while maintaining computational efficiency comparable to WRLS-based algorithms.

Conclusion: WSB framework provides a principled Bayesian approach to non-stationary contextual bandits with theoretical guarantees and practical efficiency, demonstrating advantages over existing WRLS-based methods.

Abstract: We study non-stationary linear contextual bandits through the lens of sequential Bayesian inference. Whereas existing algorithms typically rely on the Weighted Regularized Least-Squares (WRLS) objective, we study Weighted Sequential Bayesian (WSB), which maintains a posterior distribution over the time-varying reward parameters. Our main contribution is a novel concentration inequality for WSB posteriors, which introduces a prior-dependent term that quantifies the influence of initial beliefs. We show that this influence decays over time and derive tractable upper bounds that make the result useful for both analysis and algorithm design. Building on WSB, we introduce three algorithms: WSB-LinUCB, WSB-RandLinUCB, and WSB-LinTS. We establish frequentist regret guarantees: WSB-LinUCB matches the best-known WRLS-based guarantees, while WSB-RandLinUCB and WSB-LinTS improve upon them, all while preserving the computational efficiency of WRLS-based algorithms.

[544] PiCa: Parameter-Efficient Fine-Tuning with Column Space Projection

Junseo Hwang, Wonguk Cho, Taesup Kim

Main category: cs.LG

TL;DR: PiCa is a theoretically grounded parameter-efficient fine-tuning method that projects gradients onto the principal column space of pre-trained weights and uses weight-sharing to enhance efficiency, outperforming SOTA methods across NLP and vision tasks.

Details

Motivation: Fine-tuning large foundation models is computationally expensive, and existing parameter-efficient methods lack theoretical foundations despite showing promise through exploiting pre-trained weight geometry.

Method: Projects gradients onto the principal column space of pre-trained weights and employs a novel weight-sharing strategy to enhance parameter efficiency.

Result: Consistently outperforms state-of-the-art baselines across diverse NLP and vision tasks under comparable or smaller parameter budgets.

Conclusion: PiCa demonstrates both theoretical rigor and practical effectiveness for parameter-efficient fine-tuning of large foundation models.

Abstract: Fine-tuning large foundation models is essential for building expert models tailored to specialized tasks and domains, but fully updating billions of parameters is computationally prohibitive. Reducing the number of trainable parameters using parameter-efficient fine-tuning is therefore crucial not only to reduce training costs but also to mitigate storage, caching, and serving overheads during deployment. Prior works, such as Singular Vectors-guided Fine-Tuning, have shown that exploiting the geometry of pre-trained weights can significantly improve parameter-efficiency, but they lack a solid theoretical foundation. In this paper, we introduce Parameter-efficient Fine-tuning with Column Space Projection (PiCa), a novel theoretically grounded PEFT method. We prove that projecting gradients onto the principal column space of pre-trained weights provides an effective inductive bias for adaptation and further enhance parameter efficiency through a novel weight-sharing strategy. Across diverse NLP and vision tasks, PiCa consistently outperforms state-of-the-art baselines under comparable or smaller parameter budgets, demonstrating both theoretical rigor and practical effectiveness.

[545] Consistent End-to-End Estimation for Counterfactual Fairness

Yuchen Ma, Valentyn Melnychuk, Dennis Frauen, Stefan Feuerriegel

Main category: cs.LG

TL;DR: A novel counterfactual fairness predictor with theoretical guarantees that achieves state-of-the-art performance across various datasets by learning counterfactual distributions and using counterfactual mediator regularization.

Details

Motivation: Fairness in predictions is crucial due to legal, ethical, and societal reasons, but existing counterfactual fairness methods lack theoretical guarantees since counterfactuals are unobservable.

Method: Learns counterfactual distribution of descendants of sensitive attributes via tailored neural networks and enforces fair predictions through novel counterfactual mediator regularization.

Result: The method achieves state-of-the-art performance across various datasets and provides theoretical guarantees for counterfactual fairness.

Conclusion: The proposed approach effectively ensures counterfactual fairness with theoretical backing and superior empirical performance compared to existing baselines.

Abstract: Fairness in predictions is of direct importance in practice due to legal, ethical, and societal reasons. This is often accomplished through counterfactual fairness, which ensures that the prediction for an individual is the same as that in a counterfactual world under a different sensitive attribute. However, achieving counterfactual fairness is challenging as counterfactuals are unobservable, and, because of that, existing baselines for counterfactual fairness do not have theoretical guarantees. In this paper, we propose a novel counterfactual fairness predictor for making predictions under counterfactual fairness. Here, we follow the standard counterfactual fairness setting and directly learn the counterfactual distribution of the descendants of the sensitive attribute via tailored neural networks, which we then use to enforce fair predictions through a novel counterfactual mediator regularization. Unique to our work is that we provide theoretical guarantees that our method is effective in ensuring the notion of counterfactual fairness. We further compare the performance across various datasets, where our method achieves state-of-the-art performance.

[546] Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

Main category: cs.LG

TL;DR: Establishes first L^2 convergence rates for linear TD(λ) with arbitrary features, without requiring linearly independent features or algorithmic modifications, for both discounted and average-reward settings.

Details

Motivation: Previous convergence analyses for linear TD(λ) relied on linearly independent features assumption, which doesn't hold in many practical scenarios. This limitation needed to be addressed.

Method: Developed a novel stochastic approximation result that handles convergence to solution sets rather than single points, addressing the non-uniqueness issue from arbitrary features.

Result: Successfully established L^2 convergence rates for linear TD(λ) under arbitrary features in both discounted and average-reward reinforcement learning settings.

Conclusion: The paper provides the first convergence guarantees for linear TD(λ) with arbitrary features, overcoming previous limitations and expanding the algorithm’s practical applicability.

Abstract: Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

[547] Learning to Weight Parameters for Training Data Attribution

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Main category: cs.LG

TL;DR: Proposes a method to learn parameter importance weights from data for gradient-based data attribution, improving accuracy across image classification, language modeling, and diffusion tasks.

Details

Motivation: Existing gradient-based data attribution methods either treat network parameters uniformly or rely on implicit Hessian approximations, failing to fully model functional heterogeneity of network parameters.

Method: Explicitly learn parameter importance weights directly from data without requiring annotated labels, addressing functional heterogeneity in network parameters.

Result: Improves attribution accuracy across diverse tasks including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.

Conclusion: Learning parameter importance weights directly from data provides more accurate data attribution by better modeling functional heterogeneity in neural networks.

Abstract: We study gradient-based data attribution, aiming to identify which training examples most influence a given output. Existing methods for this task either treat network parameters uniformly or rely on implicit weighting derived from Hessian approximations, which do not fully model functional heterogeneity of network parameters. To address this, we propose a method to explicitly learn parameter importance weights directly from data, without requiring annotated labels. Our approach improves attribution accuracy across diverse tasks, including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.

[548] Development and Validation of a Dynamic Kidney Failure Prediction Model based on Deep Learning: A Real-World Study with External Validation

Jingying Ma, Jinwei Wang, Lanlan Lu, Yexiang Sun, Mengling Feng, Feifei Zhang, Peng Shen, Zhiqin Jiang, Shenda Hong, Luxia Zhang

Main category: cs.LG

TL;DR: KFDeep is a dynamic model that uses longitudinal EHR data for real-time kidney failure prediction, outperforming static models and providing continuous decision support in clinical practice.

Details

Motivation: Current CKD prediction models are static and cannot capture temporal disease progression trends, limiting timely intervention opportunities for this high-morbidity disease.

Method: Developed using retrospective EHR data from 4,587 patients in China, with training/validation split and external validation on a prospective cohort. The model leverages common longitudinal clinical indicators for dynamic prediction.

Result: Achieved AUROC of 0.9311 in internal validation and 0.8141 in external validation, with progressively improving dynamic predictions, good calibration, and clinically consistent interpretability.

Conclusion: KFDeep enables dynamic kidney failure prediction without additional costs, has been successfully deployed in clinical settings, and provides physicians with continuously updated decision support.

Abstract: Background: Chronic kidney disease (CKD), a progressive disease with high morbidity and mortality, has become a significant global public health problem. Most existing models are static and fail to capture temporal trends in disease progression, limiting their ability to inform timely interventions. We address this gap by developing a dynamic model that leverages common longitudinal clinical indicators from real-world Electronic Health Records (EHRs) for real-time kidney failure prediction. Findings: A retrospective cohort of 4,587 patients from Yinzhou, China, was used for model development (2,752 patients for training, 917 patients for validation) and internal validation (918 patients), while external validation was conducted on a prospective PKUFH cohort (934 patients). The model demonstrated competitive performance across datasets, with an AUROC of 0.9311 (95%CI, 0.8873-0.9749) in the internal validation cohort and 0.8141 (95%CI, 0.7728-0.8554) in the external validation cohort, alongside progressively improving dynamic predictions, good calibration, and clinically consistent interpretability. KFDeep has been deployed on an open-access website and in primary care settings. Interpretation: The KFDeep model enables dynamic prediction of kidney failure without increasing clinical examination costs. It has been integrated into existing hospital systems, providing physicians with a continuously updated decision-support tool in routine care.

[549] What happens when generative AI models train recursively on each others’ outputs?

Hung Anh Vu, Galen Reeves, Emily Wenger

Main category: cs.LG

TL;DR: This paper studies how generative AI models training on each other’s outputs affects performance, finding both benefits (exposure to novel concepts) and drawbacks (performance homogenization).

Details

Motivation: As the internet contains increasing AI-generated content, future genAI models may train on outputs from other models, creating data-mediated interactions that need understanding given society's growing dependence on these tools.

Method: The work provides empirical evidence of data-mediated interactions, develops a theoretical model for interactive training processes, and experimentally validates the theory.

Result: Data-mediated interactions can benefit models by exposing them to novel concepts potentially missed in original training data, but can also homogenize their performance on shared tasks.

Conclusion: Understanding data-mediated model interactions is critical as models increasingly train on AI-generated content, with both positive (novel concept exposure) and negative (performance homogenization) consequences.

Abstract: The internet serves as a common source of training data for generative AI (genAI) models but is increasingly populated with AI-generated content. This duality raises the possibility that future genAI models may be trained on other models’ generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society’s increasing dependence on genAI tools, understanding such data-mediated model interactions is critical. This work provides empirical evidence for how data-mediated interactions might unfold in practice, develops a theoretical model for this interactive training process, and experimentally validates the theory. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.

[550] Neuro-Symbolic AI for Analytical Solutions of Differential Equations

Orestis Oikonomou, Levi Lingsch, Dana Grund, Siddhartha Mishra, Georgios Kissas

Main category: cs.LG

TL;DR: A neuro-symbolic AI framework that combines compositional solution techniques with iterative refinement to find analytical solutions for differential equations, outperforming commercial solvers and neural networks.

Details

Motivation: Analytical solutions provide exact insights into physical processes but are difficult to find, limiting their practical application despite their fundamental importance.

Method: Integrates compositional differential equation solution techniques with iterative refinement using formal grammars, embedding candidate solutions into a low-dimensional latent manifold for probabilistic exploration.

Result: The approach overcomes longstanding barriers to extract closed-form solutions and demonstrates advantages over commercial solvers, symbolic methods, and approximate neural networks on diverse problems.

Conclusion: The neuro-symbolic framework successfully unifies numerical and symbolic differential equation solvers, achieving both generality and accuracy in finding analytical solutions for a wide variety of differential equations.

Abstract: Analytical solutions of differential equations offer exact insights into fundamental behaviors of physical processes. Their application, however, is limited as finding these solutions is difficult. To overcome this limitation, we combine two key insights. First, constructing an analytical solution requires a composition of foundational solution components. Second, iterative solvers define parameterized function spaces with constraint-based updates. Our approach merges compositional differential equation solution techniques with iterative refinement by using formal grammars, building a rich space of candidate solutions that are embedded into a low-dimensional (continuous) latent manifold for probabilistic exploration. This integration unifies numerical and symbolic differential equation solvers via a neuro-symbolic AI framework to find analytical solutions of a wide variety of differential equations. By systematically constructing candidate expressions and applying constraint-based refinement, we overcome longstanding barriers to extract such closed-form solutions. We illustrate advantages over commercial solvers, symbolic methods, and approximate neural networks on a diverse set of problems, demonstrating both generality and accuracy.

[551] Enhanced DACER Algorithm with High Diffusion Efficiency

Yinuo Wang, Likun Wang, Mining Tan, Wenjun Zou, Xujie Song, Wenxuan Wang, Tong Liu, Guojian Zhan, Tianze Zhu, Shiqi Liu, Zeyu He, Feihong Zhang, Jingliang Duan, Shengbo Eben Li

Main category: cs.LG

TL;DR: DACERv2 improves upon diffusion-based online RL by introducing Q-gradient field guidance and temporal weighting to enable efficient single-step diffusion policies without performance degradation.

Details

Motivation: Address the core trade-off in diffusion policies where more diffusion steps ensure performance but reduce efficiency, while fewer steps degrade performance - a major bottleneck for real-time online RL deployment.

Method: Uses Q-gradient field objective as auxiliary optimization target to guide denoising process at each diffusion step, with temporal weighting mechanism to handle noise elimination in early stages and output refinement in later stages.

Result: Achieves higher performance in most complex control environments with only five diffusion steps, outperforming classical and diffusion-based online RL algorithms, and shows greater multimodality.

Conclusion: DACERv2 successfully mitigates the efficiency-performance trade-off in diffusion policies, enabling effective real-time online RL with minimal diffusion steps while maintaining high performance and multimodality.

Abstract: Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, achieving state-of-the-art performance. However, it still suffers from a core trade-off: more diffusion steps ensure high performance but reduce efficiency, while fewer steps degrade performance. This remains a major bottleneck for deploying diffusion policies in real-time online RL. To mitigate this, we propose DACERv2, which leverages a Q-gradient field objective with respect to action as an auxiliary optimization target to guide the denoising process at each diffusion step, thereby introducing intermediate supervisory signals that enhance the efficiency of single-step diffusion. Additionally, we observe that the independence of the Q-gradient field from the diffusion time step is inconsistent with the characteristics of the diffusion process. To address this issue, a temporal weighting mechanism is introduced, allowing the model to effectively eliminate large-scale noise during the early stages and refine its outputs in the later stages. Experimental results on OpenAI Gym benchmarks and multimodal tasks demonstrate that, compared with classical and diffusion-based online RL algorithms, DACERv2 achieves higher performance in most complex control environments with only five diffusion steps and shows greater multimodality.

[552] LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

Zhiyuan Huang, Jiahao Chen, Yurou Liu, Bing Su

Main category: cs.LG

TL;DR: LoFT is a parameter-efficient fine-tuning framework for long-tailed semi-supervised learning that leverages foundation models to generate reliable pseudo-labels, with an extension (LoFT-OW) for open-world scenarios with out-of-distribution samples.

Details

Motivation: Existing LTSSL methods train from scratch, leading to overconfidence and low-quality pseudo-labels. The authors aim to address these issues by extending LTSSL to foundation model fine-tuning and handling practical open-world scenarios with OOD samples.

Method: Proposes LoFT framework using parameter-efficient fine-tuning of foundation models to generate reliable pseudo-labels for imbalanced learning. Extends to LoFT-OW for open-world scenarios by improving discriminative ability against OOD samples.

Result: Achieves superior performance on multiple benchmarks compared to previous approaches, even when using only 1% of the unlabeled data required by previous works.

Conclusion: Fine-tuned foundation models can generate more reliable pseudo-labels for long-tailed semi-supervised learning, and the proposed LoFT framework effectively handles both standard and open-world scenarios with significant efficiency improvements.

Abstract: Long-tailed learning has garnered increasing attention due to its wide applicability in real-world scenarios. Among existing approaches, Long-Tailed Semi-Supervised Learning (LTSSL) has emerged as an effective solution by incorporating a large amount of unlabeled data into the imbalanced labeled dataset. However, most prior LTSSL methods are designed to train models from scratch, which often leads to issues such as overconfidence and low-quality pseudo-labels. To address these challenges, we extend LTSSL into the foundation model fine-tuning paradigm and propose a novel framework: LoFT (Long-tailed semi-supervised learning via parameter-efficient Fine-Tuning). We demonstrate that fine-tuned foundation models can generate more reliable pseudolabels, thereby benefiting imbalanced learning. Furthermore, we explore a more practical setting by investigating semi-supervised learning under open-world conditions, where the unlabeled data may include out-of-distribution (OOD) samples. To handle this problem, we propose LoFT-OW (LoFT under Open-World scenarios) to improve the discriminative ability. Experimental results on multiple benchmarks demonstrate that our method achieves superior performance compared to previous approaches, even when utilizing only 1% of the unlabeled data compared with previous works.

[553] Riemannian Variational Flow Matching for Material and Protein Design

Olga Zaghen, Floor Eijkelboom, Alison Pouplin, Cong Liu, Max Welling, Jan-Willem van de Meent, Erik J. Bekkers

Main category: cs.LG

TL;DR: RG-VFM extends Variational Flow Matching to manifolds using Riemannian Gaussian distributions, providing better geometric learning than velocity-based approaches by directly minimizing geodesic distances.

Details

Motivation: On curved manifolds, the equivalence between endpoint, velocity, and noise prediction breaks down. Endpoint prediction provides stronger learning signals by directly minimizing geodesic distances, which is crucial for capturing manifold structure.

Method: Derived a variational flow matching objective based on Riemannian Gaussian distributions, applicable to manifolds with closed-form geodesics. Uses endpoint prediction rather than velocity prediction to leverage geodesic distance minimization.

Result: RG-VFM more effectively captures manifold structure and improves downstream performance over Euclidean and velocity-based baselines in synthetic spherical/hyperbolic benchmarks and real-world material/protein generation tasks.

Conclusion: RG-VFM provides a superior geometric approach for generative modeling on manifolds by incorporating curvature-dependent penalties through Jacobi fields, outperforming existing Riemannian Flow Matching methods.

Abstract: We present Riemannian Gaussian Variational Flow Matching (RG-VFM), a geometric extension of Variational Flow Matching (VFM) for generative modeling on manifolds. In Euclidean space, predicting endpoints (VFM), velocities (FM), or noise (diffusion) are largely equivalent due to affine interpolations. On curved manifolds this equivalence breaks down, and we hypothesize that endpoint prediction provides a stronger learning signal by directly minimizing geodesic distances. Building on this insight, we derive a variational flow matching objective based on Riemannian Gaussian distributions, applicable to manifolds with closed-form geodesics. We formally analyze its relationship to Riemannian Flow Matching (RFM), exposing that the RFM objective lacks a curvature-dependent penalty - encoded via Jacobi fields - that is naturally present in RG-VFM. Experiments on synthetic spherical and hyperbolic benchmarks, as well as real-world tasks in material and protein generation, demonstrate that RG-VFM more effectively captures manifold structure and improves downstream performance over Euclidean and velocity-based baselines.

[554] Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

Dongki Kim, Wonbin Lee, Sung Ju Hwang

Main category: cs.LG

TL;DR: Mol-LLaMA is a large molecular language model that integrates molecular knowledge with reasoning capabilities to improve molecular analysis and understanding in drug discovery applications.

Details

Motivation: Current molecular language models struggle with accurate molecular feature analysis due to limited knowledge and reasoning capabilities, despite their success in task transfer.

Method: Designed key data types for fundamental molecular features, proposed a module integrating complementary information from different molecular encoders, and leveraged distinct advantages of molecular representations.

Result: Mol-LLaMA demonstrates capability to comprehend general molecular features and provide informative responses, showing potential as a general-purpose assistant for molecular analysis.

Conclusion: The model exhibits explainability and reasoning ability, representing a significant advancement in molecular understanding for interdisciplinary applications across chemistry and biology.

Abstract: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.

[555] Localized Forest Fire Risk Prediction: A Department-Aware Approach for Operational Decision Support

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

Main category: cs.LG

TL;DR: This paper proposes a localized approach to forest fire prediction that tailors risk assessment to specific departmental contexts in France, moving beyond traditional binary classification to provide more actionable regional predictions.

Details

Motivation: Traditional binary classification oversimplifies forest fire prediction and doesn't account for local variations in terrain, climate, and historical experience across different firefighting departments in France.

Method: Developed a national-scale AI benchmark for metropolitan France using state-of-the-art AI models on an unexplored dataset, with a focus on departmental-level context-sensitive risk modeling.

Result: Presented the first national-scale AI benchmark for forest fire prediction in France, offering more actionable and region-specific predictions tailored to local departmental conditions.

Conclusion: The proposed localized approach provides more practical fire risk assessments for operational use by firefighting units, and identifies important future research directions in this domain.

Abstract: Forest fire prediction involves estimating the likelihood of fire ignition or related risk levels in a specific area over a defined time period. With climate change intensifying fire behavior and frequency, accurate prediction has become one of the most pressing challenges in Artificial Intelligence (AI). Traditionally, fire ignition is approached as a binary classification task in the literature. However, this formulation oversimplifies the problem, especially from the perspective of end-users such as firefighters. In general, as is the case in France, firefighting units are organized by department, each with its terrain, climate conditions, and historical experience with fire events. Consequently, fire risk should be modeled in a way that is sensitive to local conditions and does not assume uniform risk across all regions. This paper proposes a new approach that tailors fire risk assessment to departmental contexts, offering more actionable and region-specific predictions for operational use. With this, we present the first national-scale AI benchmark for metropolitan France using state-of-the-art AI models on a relatively unexplored dataset. Finally, we offer a summary of important future works that should be taken into account. Supplementary materials are available on GitHub.

[556] Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

Yanwei Jia, Du Ouyang, Yufei Zhang

Main category: cs.LG

TL;DR: This paper introduces a framework for executing stochastic policies in continuous-time reinforcement learning by sampling actions at discrete time points and proves convergence rates for the controlled state process.

Details

Motivation: Stochastic policies are widely used in continuous-time RL but executing them and evaluating performance in continuous-time environments remains challenging.

Method: Proposes a policy execution framework that samples actions from stochastic policies at discrete time points and implements them as piecewise constant controls, then analyzes convergence rates.

Result: Proves weak convergence of controlled state process to dynamics with aggregated coefficients, establishes optimal first-order convergence rate for regular coefficients, and 1/2-order convergence rates with sampling noise.

Conclusion: Provides theoretical justification for exploratory stochastic control framework and analyzes bias/variance of policy evaluation and policy gradient estimators based on discrete-time observations.

Abstract: Stochastic policies (also known as relaxed controls) are widely used in continuous-time reinforcement learning algorithms. However, executing a stochastic policy and evaluating its performance in a continuous-time environment remain open challenges. This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a $1/2$-order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a $1/2$-order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].

[557] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Eliot Beyler, Francis Bach

Main category: cs.LG

TL;DR: The paper compares full-denoising and half-denoising methods in score-based generative models, showing their relative performance depends on data distribution characteristics - half-denoising works better for regular densities while full-denoising excels for singular densities and can mitigate the curse of dimensionality.

Details

Motivation: To understand the comparative performance of full-denoising versus half-denoising approaches in score-based generative models, particularly how their effectiveness varies based on data distribution properties.

Method: Theoretical analysis comparing the optimal denoiser for quadratic loss (full-denoising) with half-denoising method, evaluating performance based on distribution distance metrics under different data assumptions.

Result: Half-denoising performs better for regular densities, while full-denoising is superior for singular densities like mixtures of Dirac measures or low-dimensional subspace-supported densities. Full-denoising can alleviate the curse of dimensionality under linear manifold assumptions.

Conclusion: The choice between full-denoising and half-denoising should depend on data characteristics - half-denoising for regular densities and full-denoising for singular densities, with full-denoising offering dimensionality reduction benefits in certain cases.

Abstract: Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ‘‘full-denoising’’, to the alternative ‘‘half-denoising’’ introduced by Hyv{"a}rinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different conclusions. We prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

[558] Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

Main category: cs.LG

TL;DR: SamS is a sample scheduling algorithm that dynamically selects training samples based on the model’s evolving states during DPO training, significantly improving performance without modifying the core DPO algorithm.

Details

Motivation: DPO performance heavily depends on human preference data quality, and existing data selection methods ignore the impact of the model's evolving states during optimization.

Method: Proposed SamS algorithm that adaptively schedules training samples in each batch based on the LLM’s learning feedback to maximize generalization performance.

Result: Simply integrating SamS significantly improves performance across tasks with minimal additional computational overhead, without modifying core DPO.

Conclusion: Batch-wise sample selection is a promising direction for improving LLM alignment, with potential generalization to RLHF and broader supervised learning paradigms.

Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model’s evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM’s learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

[559] Learnable cut flow for high energy physics

Jing Li, Hao Sun

Main category: cs.LG

TL;DR: Proposes Learnable Cut Flow (LCF), a neural network that transforms traditional cut selection into a differentiable, data-driven process with parallel and sequential strategies, and introduces Learnable Importance metric for feature analysis.

Details

Motivation: To bridge the gap between interpretable but manually-tuned cut flow methods and powerful but opaque neural networks in high energy physics.

Method: LCF implements two differentiable cut strategies (parallel and sequential) using mask operations instead of hard cuts, with a modified loss function and Learnable Importance metric for feature analysis.

Result: LCF accurately learns cut boundaries, assigns higher importance to discriminative features, handles redundant features robustly, and performs effectively in real-world scenarios, though initially underperforms other methods on diboson dataset.

Conclusion: LCF successfully bridges traditional cut flow methods and modern neural networks, providing actionable insights into training process and feature importance while maintaining interpretability.

Abstract: Neural networks have emerged as a powerful paradigm for tasks in high energy physics, yet their opaque training process renders them as a black box. In contrast, the traditional cut flow method offers simplicity and interpretability but requires extensive manual tuning to identify optimal cut boundaries. To merge the strengths of both approaches, we propose the Learnable Cut Flow (LCF), a neural network that transforms the traditional cut selection into a fully differentiable, data-driven process. LCF implements two cut strategies-parallel, where observable distributions are treated independently, and sequential, where prior cuts shape subsequent ones-to flexibly determine optimal boundaries. Building on this strategy, we introduce the Learnable Importance, a metric that quantifies feature importance and adjusts their contributions to the loss accordingly, offering model-driven insights unlike ad-hoc metrics. To ensure differentiability, a modified loss function replaces hard cuts with mask operations, preserving data shape throughout the training process. LCF is tested on six varied mock datasets and a realistic diboson vs. QCD dataset. Results demonstrate that LCF 1. accurately learns cut boundaries across typical feature distributions in both parallel and sequential strategies, 2. assigns higher importance to discriminative features with minimal overlap, 3. handles redundant or correlated features robustly, and 4. performs effectively in real-world scenarios. In the diboson dataset, LCF initially underperforms boosted decision trees and multiplayer perceptrons when using all observables. LCF bridges the gap between traditional cut flow method and modern black-box neural networks, delivering actionable insights into the training process and feature importance. Source code and experimental data are available at https://github.com/Star9daisy/learnable-cut-flow.

[560] MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Chenghan Li, Mingchen Li, Yipu Liao, Ruisheng Diao

Main category: cs.LG

TL;DR: MS-DFTVNet is a novel multi-scale 3D deformable convolutional framework for long-term time series prediction that outperforms Transformer and MLP baselines by 7.5% on average across six datasets.

Details

Motivation: Convolutional networks are underexplored for long-term time series prediction compared to Transformers and MLPs, despite their potential.

Method: Proposes multi-scale time series reshape module for cross-period interactions, 3D deformable convolutional framework, and context-aware dynamic deformable convolution to handle uneven temporal feature distribution.

Result: Significantly outperforms strong baselines with 7.5% average improvement across six public datasets, achieving new state-of-the-art results.

Conclusion: The proposed MS-DFTVNet demonstrates the effectiveness of convolutional approaches for long-term time series forecasting and establishes new benchmarks in the field.

Abstract: Research on long-term time series prediction has primarily relied on Transformer and MLP models, while the potential of convolutional networks in this domain remains underexplored. To address this, we propose a novel multi-scale time series reshape module that effectively captures cross-period patch interactions and variable dependencies. Building on this, we develop MS-DFTVNet, the multi-scale 3D deformable convolutional framework tailored for long-term forecasting. Moreover, to handle the inherently uneven distribution of temporal features, we introduce a context-aware dynamic deformable convolution mechanism, which further enhances the model’s ability to capture complex temporal patterns. Extensive experiments demonstrate that MS-DFTVNet not only significantly outperforms strong baselines but also achieves an average improvement of about 7.5% across six public datasets, setting new state-of-the-art results.

[561] Enhancing Electricity-System Resilience with Adaptive Robust Optimization and Conformal Uncertainty Characterization

Shuyi Chen, Shixiang Zhu, Ramteen Sioshansi

Main category: cs.LG

TL;DR: A tri-level optimization model integrating proactive actions, adversarial disruptions, and reactive responses for electricity system resilience, using conformal prediction for uncertainty sets and solved via duality theory and Bender’s decomposition.

Details

Motivation: Extreme weather strains electricity systems, revealing limitations of reactive responses and simplified uncertainty models in existing resilience approaches that decouple proactive and reactive decisions.

Method: Proposes a tri-level optimization model integrating proactive actions, adversarial disruptions, and reactive responses. Uses conformal prediction to construct distribution-free system-disruption uncertainty sets with coverage guarantees. Solves the tri-level problem using duality theory for bi-level reformulation and Bender’s decomposition.

Result: Numerical experiments show the approach outperforms conventional robust and two-stage methods.

Conclusion: The integrated tri-level optimization framework with distribution-free uncertainty modeling provides superior electricity system resilience planning compared to traditional approaches.

Abstract: Extreme weather is straining electricity systems, exposing the limitations of reactive responses, and prompting the need for proactive resilience planning. Most existing approaches to enhance electricity system resilience employ simplified uncertainty models and decouple proactive and reactive decisions. This paper proposes a novel tri-level optimization model that integrates proactive actions, adversarial disruptions, and reactive responses. Conformal prediction is used to construct distribution-free system-disruption uncertainty sets with coverage guarantees. The tri-level problem is solved by using duality theory to derive a bi-level reformulation and employing Bender’s decomposition. Numerical experiments demonstrate that our approach outperforms conventional robust and two-stage methods.

[562] PlaceFM: A Training-free Geospatial Foundation Model of Places using Large-Scale Point of Interest Data

Mohammad Hashemi, Hossein Amiri, Andreas Zufle

Main category: cs.LG

TL;DR: PlaceFM is a geospatial foundation model that uses a training-free, clustering-based approach to learn place representations from POI graphs, enabling efficient multi-granular spatial analysis without costly pre-training.

Details

Motivation: Existing geospatial foundation models lack flexibility in reasoning about places - context-rich regions spanning multiple spatial granularities with spatially and semantically related points of interest.

Method: A training-free, clustering-based approach that summarizes the entire point of interest graph from U.S. Foursquare data to produce general-purpose region embeddings and automatically identify places of interest.

Result: Outperforms most state-of-the-art graph-based geospatial foundation models on ZIP code-level population density and housing price prediction tasks, with up to 100x speedup in generating region-level representations on large-scale POI graphs.

Conclusion: PlaceFM provides a scalable and efficient solution for multi-granular geospatial analysis that can be directly integrated into geolocation data pipelines to support various urban downstream tasks without costly pre-training.

Abstract: With the rapid growth and continual updates of geospatial data from diverse sources, geospatial foundation model pre-training for urban representation learning has emerged as a key research direction for advancing data-driven urban planning. Spatial structure is fundamental to effective geospatial intelligence systems; however, existing foundation models often lack the flexibility to reason about places, context-rich regions spanning multiple spatial granularities that may consist of many spatially and semantically related points of interest. To address this gap, we propose PlaceFM, a geospatial foundation model that captures place representations through a training-free, clustering-based approach. PlaceFM summarizes the entire point of interest graph constructed from U.S. Foursquare data, producing general-purpose region embeddings while automatically identifying places of interest. These embeddings can be directly integrated into geolocation data pipelines to support a variety of urban downstream tasks. Without the need for costly pre-training, PlaceFM provides a scalable and efficient solution for multi-granular geospatial analysis. Extensive experiments on two real-world prediction tasks, ZIP code-level population density and housing prices, demonstrate that PlaceFM not only outperforms most state-of-the-art graph-based geospatial foundation models but also achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation is available at https://github.com/mohammadhashemii/PlaceFM.

[563] Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

Zhenghao Li, Shengbo Wang, Nian Si

Main category: cs.LG

TL;DR: This paper establishes near-optimal sample complexity bounds for divergence-based S-rectangular distributionally robust reinforcement learning, achieving optimal dependence on state/action space sizes and accuracy simultaneously.

Details

Motivation: S-rectangular DR-RL models more accurately capture real-world distributional discrepancies and yield more effective robust randomized policies compared to SA-rectangular models, but lack comprehensive statistical analysis.

Method: Empirical value iteration algorithm for divergence-based S-rectangular DR-RL with theoretical analysis of sample complexity.

Result: Achieved near-optimal sample complexity bounds of Õ(|S||A|(1-γ)⁻⁴ε⁻²), which is the first result achieving optimal dependence on |S|, |A|, and ε simultaneously for divergence-based S-rectangular models.

Conclusion: The proposed algorithm demonstrates fast learning performance validated through numerical experiments on robust inventory control and theoretical worst-case examples.

Abstract: Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-\gamma)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $\gamma$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.

[564] A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

Sanggeon Yun, Ryozo Masukawa, Hyunwoo Oh, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: A lightweight plug-in detection framework for adversarial examples that uses internal layer-wise inconsistencies within DNNs, requiring only benign data for calibration and achieving SOTA performance with minimal overhead.

Details

Motivation: Existing detection-based defenses often rely on external models, complex architectures, or adversarial data, which limits efficiency and generalizability. There's a need for more practical and efficient detection methods.

Method: Proposes two complementary strategies: Recovery Testing (RT) and Logit-layer Testing (LT) to measure violations of layer-wise Lipschitz continuity caused by adversarial perturbations, based on the A Few Large Shifts Assumption.

Result: Achieves state-of-the-art detection performance on CIFAR-10, CIFAR-100, and ImageNet under standard and adaptive threat models with negligible computational overhead. Provides formal lower-bound guarantee on accuracy.

Conclusion: The proposed lightweight framework effectively detects adversarial examples by leveraging internal model inconsistencies, offering a practical and efficient defense solution with strong theoretical guarantees.

Abstract: Deep neural networks (DNNs) are highly susceptible to adversarial examples–subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations induce large, localized violations of layer-wise Lipschitz continuity in a small subset of layers. Building on this, we propose two complementary strategies–Recovery Testing (RT) and Logit-layer Testing (LT)–to empirically measure these violations and expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead. Furthermore, our system-level analysis provides a practical method for selecting a detection threshold with a formal lower-bound guarantee on accuracy. The code is available here: https://github.com/c0510gy/AFLS-AED.

[565] Model Parallelism With Subnetwork Data Parallelism

Vaibhav Singh, Zafir Khalid, Edouard Oyallon, Eugene Belilovsky

Main category: cs.LG

TL;DR: Subnetwork Data Parallelism (SDP) is a distributed training framework that partitions models into subnetworks across workers without exchanging activations, using backward and forward masking to reduce memory usage by 30%-75% while maintaining or improving performance.

Details

Motivation: Pre-training large neural networks requires heavy memory and costly communication, creating scalability challenges for distributed training.

Method: SDP partitions models into structured subnetworks using backward masking (sparsity only in backward step) and forward masking (removes parameters in both forward and backward passes), with neuron-level and block-level construction strategies for CNNs and transformers.

Result: SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance across CNNs, transformers on CIFAR/ImageNet, and LLM pre-training on FineWeb. Forward masking sometimes achieves better performance in FLOP-matched settings.

Conclusion: SDP provides an effective distributed training framework that significantly reduces memory demands while preserving or enhancing model performance through structured subnetwork partitioning and masking strategies.

Abstract: Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both CNNs and transformers. In experiments spanning CNNs and transformers on CIFAR and ImageNet, as well as LLM pre-training on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-matched settings, forward masking can sometimes achieve better performance.

[566] Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs

Yusheng Zhao, Qixin Zhang, Xiao Luo, Weizhi Zhang, Zhiping Xiao, Wei Ju, Philip S. Yu, Ming Zhang

Main category: cs.LG

TL;DR: DENSE method uses LLMs to generate bundle-level labels from text-attributed graphs, then uses these labels to supervise graph neural networks while refining bundles to remove noise.

Details

Motivation: LLMs struggle with text attributes isolated from graph topology and produce unreliable predictions due to information insufficiency and inherent weaknesses like hallucination.

Method: Sample bundles of nodes with close proximity texts, query LLMs to get bundle-level labels, use these labels to supervise GNNs, and iteratively refine bundles to exclude noisy items.

Result: Extensive experiments across ten datasets validate the effectiveness of the proposed method.

Conclusion: DENSE successfully addresses LLM limitations in graph learning by leveraging bundle-level supervision and theoretical analysis supports the design.

Abstract: Large language models (LLMs) have been used in many zero-shot learning problems, with their strong generalization ability. Recently, adopting LLMs in text-attributed graphs (TAGs) has drawn increasing attention. However, the adoption of LLMs faces two major challenges: limited information on graph structure and unreliable responses. LLMs struggle with text attributes isolated from the graph topology. Worse still, they yield unreliable predictions due to both information insufficiency and the inherent weakness of LLMs (e.g., hallucination). Towards this end, this paper proposes a novel method named Dynamic Text Bundling Supervision (DENSE) that queries LLMs with bundles of texts to obtain bundle-level labels and uses these labels to supervise graph neural networks. Specifically, we sample a set of bundles, each containing a set of nodes with corresponding texts of close proximity. We then query LLMs with the bundled texts to obtain the label of each bundle. Subsequently, the bundle labels are used to supervise the optimization of graph neural networks, and the bundles are further refined to exclude noisy items. To justify our design, we also provide theoretical analysis of the proposed method. Extensive experiments across ten datasets validate the effectiveness of the proposed method.

[567] Can LLMs Find Fraudsters? Multi-level LLM Enhanced Graph Fraud Detection

Tairan Huang, Yili Wang, Qiutong Li, Changlong He, Jianliang Gao

Main category: cs.LG

TL;DR: MLED is a multi-level LLM-enhanced graph fraud detection framework that uses LLMs to extract knowledge from textual information and integrates it with graph structures through type-level and relation-level enhancers to improve fraud detection performance.

Details

Motivation: Existing graph fraud detection methods ignore rich semantic cues in raw textual information and struggle with multimodal fusion of text embeddings with graph structures, despite LLMs' powerful text processing capabilities.

Method: Proposed MLED framework with multi-level LLM enhancement: type-level enhancer to distinguish fraudsters from benign entities, and relation-level enhancer to emphasize fraudsters’ importance across different relations.

Result: Experiments on four real-world datasets show MLED achieves state-of-the-art performance in graph fraud detection as a generalized framework applicable to existing methods.

Conclusion: MLED successfully integrates LLMs with graph structures through multi-level enhancement, demonstrating superior fraud detection capabilities while maintaining generalizability to existing methods.

Abstract: Graph fraud detection has garnered significant attention as Graph Neural Networks (GNNs) have proven effective in modeling complex relationships within multimodal data. However, existing graph fraud detection methods typically use preprocessed node embeddings and predefined graph structures to reveal fraudsters, which ignore the rich semantic cues contained in raw textual information. Although Large Language Models (LLMs) exhibit powerful capabilities in processing textual information, it remains a significant challenge to perform multimodal fusion of processed textual embeddings with graph structures. In this paper, we propose a \textbf{M}ulti-level \textbf{L}LM \textbf{E}nhanced Graph Fraud \textbf{D}etection framework called MLED. In MLED, we utilize LLMs to extract external knowledge from textual information to enhance graph fraud detection methods. To integrate LLMs with graph structure information and enhance the ability to distinguish fraudsters, we design a multi-level LLM enhanced framework including type-level enhancer and relation-level enhancer. One is to enhance the difference between the fraudsters and the benign entities, the other is to enhance the importance of the fraudsters in different relations. The experiments on four real-world datasets show that MLED achieves state-of-the-art performance in graph fraud detection as a generalized framework that can be applied to existing methods.

[568] Learning Equivariant Models by Discovering Symmetries with Learnable Augmentations

Eduardo Santos-Escriche, Stefanie Jegelka

Main category: cs.LG

TL;DR: SEMoLA is an end-to-end approach that discovers unknown symmetries in data via learnable augmentations and encodes approximate equivariance into unconstrained models, eliminating the need for prior symmetry knowledge.

Details

Motivation: Current approaches either require a priori symmetry knowledge (soft equivariance) or lack interpretability (implicit equivariance). SEMoLA addresses these limitations by automatically discovering symmetries.

Method: Jointly discovers unknown symmetries through learnable data augmentations and encodes approximate equivariance into arbitrary unconstrained models in an end-to-end manner.

Result: SEMoLA robustly discovers relevant symmetries while achieving high prediction performance across various datasets with multiple data modalities and symmetry groups.

Conclusion: SEMoLA enables learning equivariant models without prior symmetry knowledge, offers interpretability, and maintains robustness to distribution shifts.

Abstract: Recently, a trend has emerged that favors shifting away from designing constrained equivariant architectures for data in geometric domains and instead (1) modifying the training protocol, e.g., with a specific loss and data augmentations (soft equivariance), or (2) ignoring equivariance and inferring it only implicitly. However, both options have limitations, e.g., soft equivariance still requires a priori knowledge about the underlying symmetries, while implicitly learning equivariance from data lacks interpretability. To address these limitations, we propose SEMoLA, an end-to-end approach that jointly (1) discovers a priori unknown symmetries in the data via learnable data augmentations, and uses them to (2) encode the respective approximate equivariance into arbitrary unconstrained models. Hence, it enables learning equivariant models that do not need prior knowledge about symmetries, offer interpretability, and maintain robustness to distribution shifts. Empirically, we demonstrate the ability of SEMoLA to robustly discover relevant symmetries while achieving high prediction performance across various datasets, encompassing multiple data modalities and underlying symmetry groups.

[569] Learning Beyond Experience: Generalizing to Unseen State Space with Reservoir Computing

Declan A. Norton, Yuanzhao Zhang, Michelle Girvan

Main category: cs.LG

TL;DR: Reservoir computing can generalize to unexplored regions of state space without explicit structural priors, enabling prediction in unobserved basins of multistable systems.

Details

Motivation: Machine learning techniques for dynamical systems often fail to generalize to poorly represented aspects in training data without structural priors. This work aims to show that reservoir computing can overcome this limitation.

Method: Developed a multiple-trajectory training scheme for reservoir computers that supports training across disjoint time series, applied to multistable dynamical systems.

Result: Reservoir computers trained on trajectories from a single basin of attraction successfully captured system behavior in entirely unobserved basins, demonstrating out-of-domain generalization.

Conclusion: Reservoir computing can achieve effective generalization to unexplored regions of state space without requiring explicit structural priors, enabling robust data-driven modeling of multistable dynamical systems.

Abstract: Machine learning techniques offer an effective approach to modeling dynamical systems solely from observed data. However, without explicit structural priors – built-in assumptions about the underlying dynamics – these techniques typically struggle to generalize to aspects of the dynamics that are poorly represented in the training data. Here, we demonstrate that reservoir computing – a simple, efficient, and versatile machine learning framework often used for data-driven modeling of dynamical systems – can generalize to unexplored regions of state space without explicit structural priors. First, we describe a multiple-trajectory training scheme for reservoir computers that supports training across a collection of disjoint time series, enabling effective use of available training data. Then, applying this training scheme to multistable dynamical systems, we show that RCs trained on trajectories from a single basin of attraction can achieve out-of-domain generalization by capturing system behavior in entirely unobserved basins.

[570] Tackling Federated Unlearning as a Parameter Estimation Problem

Antonio Balordi, Lorenzo Manini, Fabio Stella, Alessio Merlo

Main category: cs.LG

TL;DR: Federated Unlearning framework using information theory and Hessian analysis to selectively reset parameters sensitive to forgotten data, enabling efficient data erasure in federated learning without full retraining.

Details

Motivation: Privacy regulations require data erasure from deep learning models, which is challenging in Federated Learning where data remains on clients and full retraining is often infeasible.

Method: Uses second-order Hessian information to identify and selectively reset parameters most sensitive to the data being forgotten, followed by minimal federated retraining. Model-agnostic approach that doesn’t require server access to raw client data.

Result: Strong privacy protection (MIA success near random, categorical knowledge erased), high performance (Normalized Accuracy ≈ 0.9 vs retrained benchmarks), and effective backdoor attack neutralization.

Conclusion: Provides a practical solution for data forgetting in Federated Learning that balances privacy, performance, and efficiency.

Abstract: Privacy regulations require the erasure of data from deep learning models. This is a significant challenge that is amplified in Federated Learning, where data remains on clients, making full retraining or coordinated updates often infeasible. This work introduces an efficient Federated Unlearning framework based on information theory, modeling leakage as a parameter estimation problem. Our method uses second-order Hessian information to identify and selectively reset only the parameters most sensitive to the data being forgotten, followed by minimal federated retraining. This model-agnostic approach supports categorical and client unlearning without requiring server access to raw client data after initial information aggregation. Evaluations on benchmark datasets demonstrate strong privacy (MIA success near random, categorical knowledge erased) and high performance (Normalized Accuracy against re-trained benchmarks of $\approx$ 0.9), while aiming for increased efficiency over complete retraining. Furthermore, in a targeted backdoor attack scenario, our framework effectively neutralizes the malicious trigger, restoring model integrity. This offers a practical solution for data forgetting in FL.

[571] Learning Task-Agnostic Motifs to Capture the Continuous Nature of Animal Behavior

Jiyi Wang, Jingyang Ke, Bo Dai, Anqi Wu

Main category: cs.LG

TL;DR: MCD discovers interpretable motor motifs as latent basis functions and models behavior as continuously evolving mixtures of these motifs, overcoming limitations of discrete segmentation methods.

Details

Motivation: Existing behavior segmentation methods oversimplify animal behavior by imposing discrete syllables under restrictive generative assumptions, failing to capture the continuous structure of how animals flexibly recombine motor motifs.

Method: Motif-based continuous dynamics (MCD) discovery framework that (1) uncovers interpretable motif sets as latent basis functions using behavioral transition structure, and (2) models behavioral dynamics as continuously evolving mixtures of these motifs.

Result: Validated on multi-task gridworld, labyrinth navigation, and freely moving animal behavior. MCD identifies reusable motif components, captures continuous compositional dynamics, and generates realistic trajectories beyond traditional discrete segmentation models.

Conclusion: MCD provides a generative account of how complex animal behaviors emerge from dynamic combinations of fundamental motor motifs, advancing the quantitative study of natural behavior.

Abstract: Animals flexibly recombine a finite set of core motor motifs to meet diverse task demands, but existing behavior segmentation methods oversimplify this process by imposing discrete syllables under restrictive generative assumptions. To better capture the continuous structure of behavior generation, we introduce motif-based continuous dynamics (MCD) discovery, a framework that (1) uncovers interpretable motif sets as latent basis functions of behavior by leveraging representations of behavioral transition structure, and (2) models behavioral dynamics as continuously evolving mixtures of these motifs. We validate MCD on a multi-task gridworld, a labyrinth navigation task, and freely moving animal behavior. Across settings, it identifies reusable motif components, captures continuous compositional dynamics, and generates realistic trajectories beyond the capabilities of traditional discrete segmentation models. By providing a generative account of how complex animal behaviors emerge from dynamic combinations of fundamental motor motifs, our approach advances the quantitative study of natural behavior.

[572] Forecasting the Ionosphere from Sparse GNSS Data with Temporal-Fusion Transformers

Giacomo Acciarini, Simone Mestici, Halil Kelebek, Linnea Wolniewicz, Michael Vergalla, Madhulika Guhathakurta, Umaa Rebbapragada, Bala Poduval, Atılım Güneş Baydin, Frank Soboczenski

Main category: cs.LG

TL;DR: A machine learning framework using Temporal Fusion Transformers for forecasting ionospheric Total Electron Content (TEC) up to 24 hours ahead, achieving RMSE as low as 3.33 TECU with solar EUV irradiance as the strongest predictor.

Details

Motivation: Accurate ionospheric prediction is challenging due to nonlinear couplings between solar, geomagnetic, and thermospheric drivers, and current limitations in forecasting during strong space weather conditions with sparse global measurements.

Method: Uses Temporal Fusion Transformers (TFT) to predict sparse ionosphere data, incorporating heterogeneous inputs (solar irradiance, geomagnetic indices, GNSS-derived vertical TEC) with preprocessing and temporal alignment strategies.

Result: Achieves robust predictions up to 24 hours ahead with root mean square errors as low as 3.33 TECU, with solar EUV irradiance identified as the strongest predictive signal.

Conclusion: The framework provides both accurate forecasting and interpretability through attention-based analysis, and is released as open-source toolkit ionopy to support operational applications and scientific discovery.

Abstract: The ionosphere critically influences Global Navigation Satellite Systems (GNSS), satellite communications, and Low Earth Orbit (LEO) operations, yet accurate prediction of its variability remains challenging due to nonlinear couplings between solar, geomagnetic, and thermospheric drivers. Total Electron Content (TEC), a key ionospheric parameter, is derived from GNSS observations, but its reliable forecasting is limited by the sparse nature of global measurements and the limited accuracy of empirical models, especially during strong space weather conditions. In this work, we present a machine learning framework for ionospheric TEC forecasting that leverages Temporal Fusion Transformers (TFT) to predict sparse ionosphere data. Our approach accommodates heterogeneous input sources, including solar irradiance, geomagnetic indices, and GNSS-derived vertical TEC, and applies preprocessing and temporal alignment strategies. Experiments spanning 2010-2025 demonstrate that the model achieves robust predictions up to 24 hours ahead, with root mean square errors as low as 3.33 TECU. Results highlight that solar EUV irradiance provides the strongest predictive signals. Beyond forecasting accuracy, the framework offers interpretability through attention-based analysis, supporting both operational applications and scientific discovery. To encourage reproducibility and community-driven development, we release the full implementation as the open-source toolkit \texttt{ionopy}.

[573] Scaling Laws for Optimal Data Mixtures

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin

Main category: cs.LG

TL;DR: A systematic method using scaling laws to determine optimal data mixture proportions for large foundation models, validated across LLMs, multimodal models, and vision models.

Details

Motivation: Standard trial-and-error approaches for selecting data mixtures become impractical for large-scale pretraining, requiring a more systematic solution.

Method: Propose scaling laws that predict model loss based on model size, token count, and domain weights, with parameters estimated from small-scale training runs.

Result: Validated across three large-scale settings (LLM, NMM, LVM), showing accurate performance prediction and extrapolation to new mixtures and scales.

Conclusion: Provides principled alternative to costly trial-and-error methods for determining optimal domain weights under given training budgets.

Abstract: Large foundation models are typically trained on data from multiple domains, with the data mixture–the proportion of each domain used–playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

[574] How Much Is Too Much? Adaptive, Context-Aware Risk Detection in Naturalistic Driving

Amir Hossein Kalantari, Eleonora Papadimitriou, Arkady Zgonnikov, Amir Pooyan Afghari

Main category: cs.LG

TL;DR: Proposed a unified, context-aware framework for adaptive risk detection using driver behavior data that addresses limitations of existing methods by adapting labels and models over time and across drivers.

Details

Motivation: Existing frameworks for risk identification have limitations: they rely on predefined time windows and fixed thresholds, and assume behavior is stationary across drivers and time, leading to timing errors, weak generalization, and higher false-alarm rates.

Method: Developed a framework using rolling windows, joint optimization, dynamic calibration, and model fusion for time-stamped kinematic data. Tested with speed-weighted headway and harsh driving events using Random Forest, XGBoost, and DNN models.

Result: Speed-weighted headway provided more stable and context-sensitive classifications than harsh-event counts. XGBoost maintained consistent performance under changing thresholds, while DNN achieved higher recall at lower thresholds with greater variability. Ensemble approach balanced risk responsiveness with false alert control.

Conclusion: The framework shows promise for adaptive, context-aware risk detection that can enhance real-time safety feedback and support driver-focused interventions in intelligent transportation systems.

Abstract: Reliable risk identification based on driver behavior data underpins real-time safety feedback, fleet risk management, and evaluation of driver-assist systems. While naturalistic driving studies have become foundational for providing real-world driver behavior data, the existing frameworks for identifying risk based on such data have two fundamental limitations: (i) they rely on predefined time windows and fixed thresholds to disentangle risky and normal driving behavior, and (ii) they assume behavior is stationary across drivers and time, ignoring heterogeneity and temporal drift. In practice, these limitations can lead to timing errors and miscalibration in alerts, weak generalization to new drivers/routes/conditions, and higher false-alarm and miss rates, undermining driver trust and reducing safety intervention effectiveness. To address this gap, we propose a unified, context-aware framework that adapts labels and models over time and across drivers via rolling windows, joint optimization, dynamic calibration, and model fusion, tailored for time-stamped kinematic data. The framework is tested using two safety indicators, speed-weighted headway and harsh driving events, and three models: Random Forest, XGBoost, and Deep Neural Network (DNN). Speed-weighted headway yielded more stable and context-sensitive classifications than harsh-event counts. XGBoost maintained consistent performance under changing thresholds, whereas DNN achieved higher recall at lower thresholds but with greater variability across trials. The ensemble aggregated signals from multiple models into a single risk decision, balancing responsiveness to risky behavior with control of false alerts. Overall, the framework shows promise for adaptive, context-aware risk detection that can enhance real-time safety feedback and support driver-focused interventions in intelligent transportation systems.

[575] Discovering Software Parallelization Points Using Deep Neural Networks

Izavan dos S. Correia, Henrique C. T. Santos, Tiago A. E. Ferreira

Main category: cs.LG

TL;DR: Deep learning approach using DNN and CNN models to classify loops as parallelizable or ambiguous, with CNN showing slightly better performance.

Details

Motivation: To automate the identification of parallelizable loops in code for software optimization and performance improvement.

Method: Used genetic algorithm-based code generators to create datasets of independent (parallelizable) and ambiguous loops, then trained DNN and CNN models on tokenized code snippets.

Result: CNN showed slightly higher mean performance than DNN, both models had similar variability, and data diversity was crucial for performance.

Conclusion: Deep learning is feasible for automating parallelizable loop identification, offering promising tools for software optimization.

Abstract: This study proposes a deep learning-based approach for discovering loops in programming code according to their potential for parallelization. Two genetic algorithm-based code generators were developed to produce two distinct types of code: (i) independent loops, which are parallelizable, and (ii) ambiguous loops, whose dependencies are unclear, making them impossible to define if the loop is parallelizable or not. The generated code snippets were tokenized and preprocessed to ensure a robust dataset. Two deep learning models - a Deep Neural Network (DNN) and a Convolutional Neural Network (CNN) - were implemented to perform the classification. Based on 30 independent runs, a robust statistical analysis was employed to verify the expected performance of both models, DNN and CNN. The CNN showed a slightly higher mean performance, but the two models had a similar variability. Experiments with varying dataset sizes highlighted the importance of data diversity for model performance. These results demonstrate the feasibility of using deep learning to automate the identification of parallelizable structures in code, offering a promising tool for software optimization and performance improvement.

Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: HDC-GSR uses hyperdimensional computing to refine LLM-generated mission-specific graphs for video anomaly detection, improving performance without structural-distribution learning.

Details

Motivation: Existing graph structure refinement methods are ineffective for LLM-generated mission-specific graphs due to skewed connectivity and lack of large-scale pre-training datasets.

Method: Proposed MissionHD framework that encodes graphs with constrained graph-neural operations, aligns them with downstream task loss, and decodes refined structures using hyperdimensional computing.

Result: Experiments on VAD/VAR benchmarks show that MissionHD-refined graphs consistently improve performance in video anomaly tasks.

Conclusion: HDC-GSR is established as an effective pre-processing step for structured reasoning in video anomaly detection and recognition tasks.

Abstract: LLM-generated reasoning graphs, referred to as mission-specific graphs (MSGs), are increasingly used for video anomaly detection (VAD) and recognition (VAR). These MSGs are novel artifacts: they often exhibit skewed connectivity and lack large-scale datasets for pre-training, which makes existing graph structure refinement (GSR) methods ineffective. To address this challenge, we propose HDC-constrained Graph Structure Refinement (HDC-GSR), a paradigm that leverages hyperdimensional computing (HDC) to optimize decodable graph representations without relying on structural-distribution learning. Building on this paradigm, we introduce MissionHD, an HDC framework that encodes graphs with constrained graph-neural operations, aligns them directly with downstream task loss, and decodes refined structures. Experiments on VAD/VAR benchmarks demonstrate that MissionHD-refined graphs consistently improve performance, establishing HDC-GSR as an effective pre-processing step for structured reasoning in video anomaly tasks.

[577] Learning From Simulators: A Theory of Simulation-Grounded Learning

Carson Dudley, Marisa Eisenberg

Main category: cs.LG

TL;DR: SGNNs are predictive models trained on synthetic simulation data that achieve state-of-the-art performance when real-world labels are limited. This paper provides a formal statistical framework showing SGNNs converge to Bayes-optimal predictors under the synthetic distribution, with guarantees for parameter learning and mechanistic explanations.

Details

Motivation: SGNNs have achieved strong empirical performance in data-limited domains but lack formal theoretical underpinning. The authors aim to establish a unified statistical framework to understand their properties and guarantees.

Method: The authors place SGNNs in a statistical framework interpreting them as amortized Bayesian predictors trained under simulator-induced priors. They apply distribution shift theory and develop SGNN-specific results including conditions for parameter learnability and back-to-simulation attribution methods.

Result: Numerical experiments show SGNNs recover latent parameters, remain robust under simulator-reality mismatch, and outperform classical tools - achieving half the error of AIC in model selection tasks for distinguishing mechanistic dynamics.

Conclusion: SGNNs are established as a principled and practical framework for scientific prediction in data-limited regimes, with formal guarantees for parameter learning and mechanistic interpretability.

Abstract: Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We place SGNNs in a unified statistical framework. Under standard loss functions, they can be interpreted as amortized Bayesian predictors trained under a simulator-induced prior. Empirical risk minimization then yields convergence to the Bayes-optimal predictor under the synthetic distribution. We employ classical results on distribution shift to characterize how performance degrades when the simulator diverges from reality. Beyond these consequences, we develop SGNN-specific results: (i) conditions under which unobserved scientific parameters are learnable via simulation, and (ii) a back-to-simulation attribution method that provides mechanistic explanations of predictions by linking them to the simulations the model deems similar, with guarantees of posterior consistency. We provide numerical experiments to validate theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes.

[578] IndexNet: Timestamp and Variable-Aware Modeling for Time Series Forecasting

Beiliang Wu, Peiyuan Liu, Yifan Hu, Luyan Zhang, Ao Hu, Zenglin Xu

Main category: cs.LG

TL;DR: IndexNet is an MLP-based framework for multivariate time series forecasting that incorporates timestamp and variable index embeddings to capture long-term periodic patterns and distinguish between heterogeneous variables.

Details

Motivation: Most existing MTSF methods overlook index-related descriptive information like timestamps and variable indices, which carry rich contextual semantics that could improve forecasting performance.

Method: Proposes IndexNet with Index Embedding module containing Timestamp Embedding (transforms timestamps into vectors) and Channel Embedding (assigns unique identity embeddings to variables based on indices). Uses MLP-based architecture for lightweight periodic pattern capture.

Result: Extensive experiments on 12 real-world datasets show comparable performance to mainstream baselines. Plug-and-play experiments and visualization analyses demonstrate strong generality and interpretability.

Conclusion: IndexNet effectively leverages temporal and variable-aware design through index embeddings, achieving competitive performance while addressing underexplored aspects of generality and interpretability in MTSF research.

Abstract: Multivariate time series forecasting (MTSF) plays a vital role in a wide range of real-world applications, such as weather prediction and traffic flow forecasting. Although recent advances have significantly improved the modeling of temporal dynamics and inter-variable dependencies, most existing methods overlook index-related descriptive information, such as timestamps and variable indices, which carry rich contextual semantics. To unlock the potential of such information and take advantage of the lightweight and powerful periodic capture ability of MLP-based architectures, we propose IndexNet, an MLP-based framework augmented with an Index Embedding (IE) module. The IE module consists of two key components: Timestamp Embedding (TE) and Channel Embedding (CE). Specifically, TE transforms timestamps into embedding vectors and injects them into the input sequence, thereby improving the model’s ability to capture long-term complex periodic patterns. In parallel, CE assigns each variable a unique and trainable identity embedding based on its index, allowing the model to explicitly distinguish between heterogeneous variables and avoid homogenized predictions when input sequences seem close. Extensive experiments on 12 diverse real-world datasets demonstrate that IndexNet achieves comparable performance across mainstream baselines, validating the effectiveness of our temporally and variably aware design. Moreover, plug-and-play experiments and visualization analyses further reveal that IndexNet exhibits strong generality and interpretability, two aspects that remain underexplored in current MTSF research.

[579] Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization

Pascal Esser, Maximilian Fleissner, Debarghya Ghoshdastidar

Main category: cs.LG

TL;DR: This paper provides a theoretical overview of representation learning from unlabeled data, focusing on analyzing modern deep learning models that use self-supervision and denoising/masked autoencoders, which cannot be easily explained by classical theories.

Details

Motivation: Current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories, making it difficult to characterize the representations learned and explain their strong performance across diverse tasks.

Method: The paper combines mathematical tools from statistics and optimization to provide theoretical analysis of representation learning from unlabeled data, particularly focusing on visual foundation models using self-supervision and denoising/masked autoencoders.

Result: The paper presents recent theoretical advances in understanding representation learning from unlabeled data and includes the authors’ contributions to this research direction.

Conclusion: There is a need for new theoretical frameworks that combine statistics and optimization to explain the success of modern unsupervised representation learning methods and characterize the representations they learn.

Abstract: Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.

[580] Beyond Slater’s Condition in Online CMDPs with Stochastic and Adversarial Constraints

Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Main category: cs.LG

TL;DR: Novel algorithm for online episodic Constrained MDPs with improved guarantees over state-of-the-art, achieving O(√T) regret and constraint violation in stochastic settings without Slater’s condition, and sublinear violation in adversarial settings.

Details

Motivation: To address limitations of existing best-of-both-worlds algorithms for CMDPs, particularly the reliance on Slater's condition and inability to handle settings without strictly feasible solutions.

Method: Developed a new algorithm for online episodic CMDPs that works under both stochastic and adversarial constraints without requiring Slater’s condition.

Result: Achieves O(√T) regret and constraint violation in stochastic regimes, provides guarantees on positive constraint violation, ensures sublinear constraint violation in adversarial regimes without Slater’s condition, and achieves sublinear α-regret relative to unconstrained optimum.

Conclusion: The proposed algorithm significantly improves upon state-of-the-art methods, handles more challenging constraint scenarios, and demonstrates practical effectiveness through synthetic experiments.

Abstract: We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater’s condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater’s condition, and achieves sublinear $\alpha$-regret with respect to the \emph{unconstrained} optimum, where $\alpha$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.

[581] Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

Amirhossein Zare, Amirhessam Zare, Parmida Sadat Pezeshki, Herlock, Rahimi, Ali Ebrahimi, Ignacio Vázquez-García, Leo Anthony Celi

Main category: cs.LG

TL;DR: LEO-CVAE is a generative oversampling method that incorporates local uncertainty through entropy-guided loss and sampling to address class imbalance in high-dimensional biomedical data.

Details

Motivation: Traditional oversampling methods produce implausible samples, while standard generative models neglect uncertain boundary regions that are important for classification.

Method: Uses CVAE with local entropy computation to identify uncertain regions, then applies Local Entropy-Weighted Loss and entropy-guided sampling to focus on class-overlapping areas.

Result: Outperforms traditional oversampling and generative baselines on clinical genomics datasets (ADNI and TCGA lung cancer), improving classifier performance.

Conclusion: Uncertainty-aware generative oversampling is valuable for imbalanced learning in domains with complex nonlinear structures like omics data.

Abstract: Class imbalance remains a major challenge in machine learning, especially for high-dimensional biomedical data where nonlinear manifold structures dominate. Traditional oversampling methods such as SMOTE rely on local linear interpolation, often producing implausible synthetic samples. Deep generative models like Conditional Variational Autoencoders (CVAEs) better capture nonlinear distributions, but standard variants treat all minority samples equally, neglecting the importance of uncertain, boundary-region examples emphasized by heuristic methods like Borderline-SMOTE and ADASYN. We propose Local Entropy-Guided Oversampling with a CVAE (LEO-CVAE), a generative oversampling framework that explicitly incorporates local uncertainty into both representation learning and data generation. To quantify uncertainty, we compute Shannon entropy over the class distribution in a sample’s neighborhood: high entropy indicates greater class overlap, serving as a proxy for uncertainty. LEO-CVAE leverages this signal through two mechanisms: (i) a Local Entropy-Weighted Loss (LEWL) that emphasizes robust learning in uncertain regions, and (ii) an entropy-guided sampling strategy that concentrates generation in these informative, class-overlapping areas. Applied to clinical genomics datasets (ADNI and TCGA lung cancer), LEO-CVAE consistently improves classifier performance, outperforming both traditional oversampling and generative baselines. These results highlight the value of uncertainty-aware generative oversampling for imbalanced learning in domains governed by complex nonlinear structures, such as omics data.

[582] Analysis of Variational Sparse Autoencoders

Zachary Baker, Yuxiao Li

Main category: cs.LG

TL;DR: The paper investigates whether incorporating variational methods into Sparse Autoencoders (SAEs) improves feature organization and interpretability, but finds that naive application actually degrades performance.

Details

Motivation: To explore if variational methods can enhance SAEs by improving feature organization and interpretability through probabilistic sampling that creates dispersive pressure in latent space.

Method: Introduces Variational Sparse Autoencoder (vSAE) that replaces deterministic ReLU gating with stochastic sampling from Gaussian posteriors and adds KL divergence regularization toward standard normal prior. Evaluated against standard TopK SAE on Pythia-70M transformer activations using SAE Bench, feature interpretability analysis, and t-SNE visualization.

Result: vSAE underperforms standard SAE across core metrics, though excels at feature independence and ablation metrics. KL divergence causes excessive regularization that substantially reduces living features, leading to performance degradation. vSAE features show improved robustness but many more dead features.

Conclusion: Naive application of variational methods to SAEs does not improve feature organization or interpretability, as the KL divergence term creates problematic regularization pressure.

Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability. We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and incorporates KL divergence regularization toward a standard normal prior. Our hypothesis is that this probabilistic sampling creates dispersive pressure, causing features to organize more coherently in the latent space while avoiding overlap. We evaluate a TopK vSAE against a standard TopK SAE on Pythia-70M transformer residual stream activations using comprehensive benchmarks including SAE Bench, individual feature interpretability analysis, and global latent space visualization through t-SNE. The vSAE underperforms standard SAE across core evaluation metrics, though excels at feature independence and ablation metrics. The KL divergence term creates excessive regularization pressure that substantially reduces the fraction of living features, leading to observed performance degradation. While vSAE features demonstrate improved robustness, they exhibit many more dead features than baseline. Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.

[583] Beyond Outliers: A Study of Optimizers Under Quantization

Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: This paper systematically studies how optimizer choice affects model robustness under quantization, finding that Shampoo optimizer performs best for quantization-aware training and achieves highest parameter efficiency.

Details

Motivation: There is limited systematic evidence on how optimizer choice interacts with model quantization, despite both areas seeing significant progress. The research aims to fill this gap by studying optimizer-quantization interactions.

Method: Trained full-precision models (50M-1.5B parameters) with six optimizers, applied post-training quantization (PTQ) and quantization-aware training (QAT), analyzed outlier metrics, and derived scaling laws for QAT under different optimizers.

Result: Found that outlier metrics like MMR and Kurtosis fail to predict PTQ performance across optimizers. In QAT, Shampoo optimizer showed lowest accuracy degradation and highest parameter efficiency. Optimizers performing well in original pretraining may not remain optimal under QAT.

Conclusion: Optimizer choice significantly impacts model robustness under quantization, with Shampoo emerging as the most effective optimizer for quantization-aware training and achieving superior parameter efficiency.

Abstract: As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.

[584] Sketching Low-Rank Plus Diagonal Matrices

Andres Fernandez, Felix Dangel, Philipp Hennig, Frank Schneider

Main category: cs.LG

TL;DR: SKETCHLORD is a method that simultaneously estimates both low-rank and diagonal components of linear operators, outperforming sequential approaches and providing high-fidelity approximations for large-scale operators.

Details

Motivation: Many machine learning tasks involve high-dimensional linear operators that are costly to evaluate. Existing sketched methods can only construct either low-rank or diagonal approximations, but many real-world operators have both structures (Low-Rank plus Diagonal).

Method: SKETCHLORD jointly estimates low-rank and diagonal components through convex optimization, using few matrix-vector products to construct approximations for LoRD operators.

Result: Theoretical and empirical results show SKETCHLORD’s joint estimation is superior to sequential variants. Comprehensive experiments on synthetic LoRD matrices confirm accurate structure recovery.

Conclusion: SKETCHLORD provides a valuable addition to structured approximation methods, particularly for high-fidelity approximations of large-scale operators like deep learning Hessians.

Abstract: Many relevant machine learning and scientific computing tasks involve high-dimensional linear operators accessible only via costly matrix-vector products. In this context, recent advances in sketched methods have enabled the construction of either low-rank or diagonal approximations from few matrix-vector products. This provides great speedup and scalability, but approximation errors arise due to the assumed simpler structure. This work introduces SKETCHLORD, a method that simultaneously estimates both low-rank and diagonal components, targeting the broader class of Low-Rank plus Diagonal (LoRD) linear operators. We demonstrate theoretically and empirically that this joint estimation is superior also to any sequential variant (diagonal-then-low-rank or low-rank-then-diagonal). Then, we cast SKETCHLORD as a convex optimization problem, leading to a scalable algorithm. Comprehensive experiments on synthetic (approximate) LoRD matrices confirm SKETCHLORD’s performance in accurately recovering these structures. This positions it as a valuable addition to the structured approximation toolkit, particularly when high-fidelity approximations are desired for large-scale operators, such as the deep learning Hessian.

[585] On Predictability of Reinforcement Learning Dynamics for Large Language Models

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

Main category: cs.LG

TL;DR: RL training in LLMs shows rank-1 dominant parameter updates that evolve linearly, enabling AlphaRL framework for 2.5x acceleration while maintaining >96% reasoning performance.

Details

Motivation: To understand the underlying parameter dynamics during RL training of LLMs, which remain poorly understood despite recent advances in reasoning capabilities.

Method: Identified two fundamental properties: Rank-1 Dominance (top singular subspace determines reasoning improvements) and Rank-1 Linear Dynamics (dominant subspace evolves linearly). Proposed AlphaRL framework that extrapolates final parameter updates using early training data.

Result: Validated across 8 LLMs and 7 algorithms, recovering over 99% of performance gains. AlphaRL achieved up to 2.5x speedup while retaining >96% of reasoning performance without extra modules or hyperparameter tuning.

Conclusion: These findings provide a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

Abstract: Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

[586] A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks

Bo Hu, José C. Príncipe

Main category: cs.LG

TL;DR: This paper combines Mixture Density Networks with contrastive learning costs using four kernelized matrix cost functions for data density approximation.

Details

Motivation: To improve self-supervised and contrastive feature learning by integrating pairwise distance-based costs with mixture density modeling for better data density approximation.

Method: Combines Mixture Density Networks with contrastive costs using four kernelized matrix cost types: scalar cost, vector-matrix cost, matrix-matrix cost (trace of Schur complement), and SVD cost (nuclear norm).

Result: Proposes a framework for learning multiple centers needed to define mixture densities through contrastive cost functions.

Conclusion: The integration of MDNs with contrastive costs using kernelized matrix functions provides an effective approach for data density approximation in feature learning.

Abstract: Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.

[587] Neural Diffusion Processes for Physically Interpretable Survival Prediction

Alessio Cristofoletto, Cesare Rollo, Giovanni Birolo, Piero Fariselli

Main category: cs.LG

TL;DR: DeepFHT is a survival analysis framework combining deep neural networks with first hitting time distributions from stochastic processes, representing time-to-event as diffusion process absorption with interpretable physics-based parameters.

Details

Motivation: To create a survival analysis method that maintains interpretability while capturing time-varying risk without assuming proportional hazards, bridging stochastic process theory with deep learning.

Method: Couples neural networks with first hitting time distributions, mapping inputs to diffusion process parameters (initial condition, drift, diffusion) and using closed-form survival/hazard functions from Brownian motion processes.

Result: Achieves predictive accuracy comparable to state-of-the-art approaches on synthetic and real-world datasets while providing interpretable parameterization that elucidates feature-risk relationships.

Conclusion: The framework successfully combines stochastic process theory with deep learning, offering a principled approach for modeling survival phenomena in complex systems with both accuracy and interpretability.

Abstract: We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed-form survival and hazard functions and captures time-varying risk without assuming proportional-hazards. We compare DeepFHT with Cox survival model using synthetic and real-world datasets. The method achieves predictive accuracy on par with state-of-the-art approaches, while maintaining a physics-based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems.

[588] NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training

Suli Wang, Yangshen Deng, Zhenghua Bao, Xinyu Zhan, Yiqun Duan

Main category: cs.LG

TL;DR: A two-stage alignment strategy for EEG foundation models that combines domain-specific self-supervised fine-tuning (NeuroTTT) with test-time training to address misalignment between pretraining and downstream tasks, achieving state-of-the-art performance across diverse BCI applications.

Details

Motivation: Large-scale EEG foundation models suffer from misalignment between pretraining objectives and downstream tasks, as well as significant cross-subject distribution shifts, limiting their generalization in brain-computer interface applications.

Method: Two-stage approach: 1) NeuroTTT - domain-specific self-supervised fine-tuning that aligns latent representations to spectral, spatial, and temporal EEG features without labeled data; 2) Test-time training with self-supervised updates and prediction entropy minimization (Tent) that adapts normalization statistics to individual test samples.

Result: Substantially improved robustness and accuracy across diverse BCI tasks (imagined speech, stress detection, motor imagery), pushing CBraMod and LaBraM foundation models to markedly higher performance levels and achieving state-of-the-art results.

Conclusion: The proposed alignment strategy successfully bridges the gap between generic pretraining and specific EEG decoding tasks, demonstrating that unifying domain-tuned self-supervision with test-time training effectively addresses misalignment and distribution shift challenges in EEG foundation models.

Abstract: Large-scale foundation models for EEG signals offer a promising path to generalizable brain-computer interface (BCI) applications, but they often suffer from misalignment between pretraining objectives and downstream tasks, as well as significant cross-subject distribution shifts. This paper addresses these challenges by introducing a two-stage alignment strategy that bridges the gap between generic pretraining and specific EEG decoding tasks. First, we propose NeuroTTT: a domain-specific self-supervised fine-tuning paradigm that augments the foundation model with task-relevant self-supervised objectives, aligning latent representations to important spectral, spatial, and temporal EEG features without requiring additional labeled data. Second, we incorporate test-time training (TTT) at inference, we perform (i) self-supervised test-time training on individual unlabeled test samples and (ii) prediction entropy minimization (Tent), which updates only normalization statistics to continually calibrate the model to each new input on the fly. Our approach, which, to our knowledge, is the first to unify domain-tuned self-supervision with test-time training in large-scale EEG foundation models, yields substantially improved robustness and accuracy across diverse BCI tasks (imagined speech, stress detection, motor imagery). Using CBraMod and LaBraM as backbones, our method pushes their performance to a markedly higher level. Results on three diverse tasks demonstrate that the proposed alignment strategy achieves state-of-the-art performance, outperforming conventional fine-tuning and adaptation methods. Our code is available at https://github.com/wsl2000/NeuroTTT.

[589] Large Language Models Inference Engines based on Spiking Neural Networks

Adarsha Balaji, Sandeep Madireddy

Main category: cs.LG

TL;DR: The paper proposes NeurTransformer, a methodology for designing transformer-based spiking neural networks (SNNs) that addresses the computational inefficiency of existing SNN training methods and conversion techniques, achieving significant energy savings while maintaining performance.

Details

Motivation: Transformer models have quadratic time and space complexity with input sequence length, making them computationally expensive. Existing SNN training methods are inefficient, and conversion techniques require many spike time-steps, increasing latency.

Method: Proposes NeurTransformer with three steps: (1) replace self-attention with spike-based self-attention (SSA), (2) convert feed-forward blocks to SNN equivalents, and (3) fine-tune SSA using SNN surrogate learning algorithms.

Result: Converted GPT-2 small models show 5-12% loss in cosine similarity and 9.7% reduction in perplexity. SSA blocks achieve 64.71%-85.28% reduction in estimated energy consumption compared to standard self-attention on digital hardware.

Conclusion: NeurTransformer provides an effective methodology for creating energy-efficient transformer-based SNNs that maintain reasonable performance while significantly reducing computational costs.

Abstract: Foundational models based on the transformer architecture are currently the state-of-the-art in general language modeling, as well as in scientific areas such as material science and climate. However, training and deploying these models is computationally challenging as the time and space complexity has a quadratic relation to the input sequence length. Several efforts exploring efficient computational paradigms and model architectures to address these limitations have been made. In this work, we explore spiking neural networks (SNNs) to design transformer models. A challenge in training large-scale SNNs, using existing surrogate learning methods is inefficient and time-consuming. On the other hand, techniques to convert existing transformer-based models to their SNN equivalent are not scalable, as achieving optimal performance comes at the cost of a large number of spike time-steps, i.e. increased latency. To address this, we propose NeurTransformer, a methodology for designing transformer-based SNN for inference using a supervised fine-tuning approach with existing conversion methods. The proposed methodology works by: (1) replacing the self-attention mechanism with a spike-based self-attention (SSA), (2) converting the feed-forward block of the trained transformer model to its equivalent SNN, and (3) fine-tuning the SSA block using SNN-based surrogate learning algorithms. We benchmark the proposed methodology and demonstrate its accuracy and scalability using three variants of the GPT-2 model of increasing model size. We observe that the converted GPT-2 small models demonstrate a 5-12% loss in cosine similarity and a 9.7% reduction in perplexity. Finally, we demonstrate the energy efficiency of the SSA block compared to the ASA block and show between 64.71% and 85.28% reductions in estimated energy consumption when implementing the self-attention mechanism on a digital hardware.

[590] Comparison of Machine Learning Models to Classify Documents on Digital Development

Uvini Ranaweera, Bawun Mawitagama, Sanduni Liyanage, Sandupa Keshan, Tiloka de Silva, Supun Hewawalpita

Main category: cs.LG

TL;DR: This paper investigates automated document classification for digital development interventions using multiple ML algorithms and a One vs Rest approach to optimize performance on an imbalanced dataset.

Details

Motivation: The growth of digital databases and the need for effective document classification in the emerging field of digital development interventions, where traditional single-model approaches may not perform well across different contexts.

Method: Used multiple ML algorithms (Decision Trees, k-NN, SVM, AdaBoost, SGD, Naive Bayes, Logistic Regression) with oversampling for class imbalance, and employed One vs Rest approach instead of traditional multiclass classification.

Result: Found that data quantity alone doesn’t determine performance; class similarity and dissimilarity are crucial factors. The combined One vs Rest model optimized classification performance.

Conclusion: The study demonstrates the importance of considering class characteristics beyond just data volume, and shows that a combined One vs Rest approach can effectively handle multiclass classification in digital development document categorization.

Abstract: Automated document classification is a trending topic in Natural Language Processing (NLP) due to the extensive growth in digital databases. However, a model that fits well for a specific classification task might perform weakly for another dataset due to differences in the context. Thus, training and evaluating several models is necessary to optimise the results. This study employs a publicly available document database on worldwide digital development interventions categorised under twelve areas. Since digital interventions are still emerging, utilising NLP in the field is relatively new. Given the exponential growth of digital interventions, this research has a vast scope for improving how digital-development-oriented organisations report their work. The paper examines the classification performance of Machine Learning (ML) algorithms, including Decision Trees, k-Nearest Neighbors, Support Vector Machine, AdaBoost, Stochastic Gradient Descent, Naive Bayes, and Logistic Regression. Accuracy, precision, recall and F1-score are utilised to evaluate the performance of these models, while oversampling is used to address the class-imbalanced nature of the dataset. Deviating from the traditional approach of fitting a single model for multiclass classification, this paper investigates the One vs Rest approach to build a combined model that optimises the performance. The study concludes that the amount of data is not the sole factor affecting the performance; features like similarity within classes and dissimilarity among classes are also crucial.

[591] How Foundational are Foundation Models for Time Series Forecasting?

Nouha Karaouli, Denis Coquenet, Elisa Fromont, Martial Mermillod, Marina Reyboz

Main category: cs.LG

TL;DR: Time series foundation models have limited zero-shot capabilities tied to their pretraining domains and don’t provide substantial benefits over smaller dedicated models for forecasting tasks.

Details

Motivation: To examine whether foundation models, which work well for language and vision, are equally effective for time series data given its inherent diversity.

Method: Used forecasting as downstream task to evaluate zero-shot capabilities and fine-tuning performance of time series foundation models compared to smaller dedicated models.

Result: Zero-shot capabilities are domain-dependent, and fine-tuned foundation models don’t consistently outperform smaller dedicated models despite higher parameter count and memory usage.

Conclusion: Time series data’s diversity makes foundation models less suitable compared to language/vision domains, and dedicated models remain more practical for forecasting tasks.

Abstract: Foundation Models are designed to serve as versatile embedding machines, with strong zero shot capabilities and superior generalization performance when fine-tuned on diverse downstream tasks. While this is largely true for language and vision foundation models, we argue that the inherent diversity of time series data makes them less suited for building effective foundation models. We demonstrate this using forecasting as our downstream task. We show that the zero-shot capabilities of a time series foundation model are significantly influenced and tied to the specific domains it has been pretrained on. Furthermore, when applied to unseen real-world time series data, fine-tuned foundation models do not consistently yield substantially better results, relative to their increased parameter count and memory footprint, than smaller, dedicated models tailored to the specific forecasting task at hand.

[592] Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu

Main category: cs.LG

TL;DR: LLM unlearning effects are fragile and can be neutralized by post-processing. This paper shows that using lower-grade optimizers (zeroth-order or compressed-gradient) improves robustness by converging to harder-to-disturb loss landscape basins. A hybrid optimizer combining first-order and zeroth-order updates is proposed.

Details

Motivation: Address the fragility of LLM unlearning effects that can be easily neutralized by post-processing manipulations like weight quantization or fine-tuning, and improve unlearning robustness independent of specific unlearning objectives.

Method: Investigate optimizer grade’s role in unlearning robustness, propose using downgraded optimizers (zeroth-order or compressed-gradient variants), and develop a hybrid optimizer combining first-order and zeroth-order updates.

Result: Downgraded optimizers produce more robust unlearning by converging to harder-to-disturb loss landscape basins. The proposed hybrid optimizer achieves resilient forgetting without sacrificing unlearning quality on MUSE and WMDP benchmarks.

Conclusion: Optimizer grade significantly impacts unlearning robustness, with lower-grade optimizers providing better resistance to post-training perturbations while maintaining unlearning efficacy.

Abstract: Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the ‘grade’ of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

[593] Are Time Series Foundation Models Susceptible to Catastrophic Forgetting?

Nouha Karaouli, Denis Coquenet, Elisa Fromont, Martial Mermillod, Marina Reyboz

Main category: cs.LG

TL;DR: TSFMs suffer from catastrophic forgetting when fine-tuned sequentially on multiple datasets, showing a stability-plasticity trade-off.

Details

Motivation: To investigate the robustness of Time Series Foundation Models (TSFMs) to continual adaptation and explore their susceptibility to catastrophic forgetting.

Method: Using synthetic datasets with varying periodic structures, measuring trade-off between adaptation to new data and retention of prior knowledge through sequential fine-tuning.

Result: Fine-tuning improves performance on new tasks but causes significant degradation on previously learned ones, demonstrating catastrophic forgetting.

Conclusion: TSFMs face a fundamental stability-plasticity dilemma in continual adaptation scenarios.

Abstract: Time Series Foundation Models (TSFMs) have shown promising zero-shot generalization across diverse forecasting tasks. However, their robustness to continual adaptation remains underexplored. In this work, we investigate the extent to which TSFMs suffer from catastrophic forgetting when fine-tuned sequentially on multiple datasets. Using synthetic datasets designed with varying degrees of periodic structure, we measure the trade-off between adaptation to new data and retention of prior knowledge. Our experiments reveal that, while fine-tuning improves performance on new tasks, it often causes significant degradation on previously learned ones, illustrating a fundamental stability-plasticity dilemma.

cs.MA

[594] LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Hamid Palangi, Tomas Pfister

Main category: cs.MA

TL;DR: Proposes a blackboard-based multi-agent communication paradigm for data discovery in large data lakes, eliminating the need for central coordination and achieving significant performance improvements over existing methods.

Details

Motivation: Existing methods struggle with data discovery in large heterogeneous data lakes - single-agent systems are overwhelmed, while master-slave multi-agent systems require precise knowledge of sub-agent capabilities.

Method: A multi-agent communication framework inspired by blackboard architecture, where a central agent posts requests to a shared blackboard and autonomous subordinate agents volunteer to respond based on their capabilities.

Result: Substantially outperforms baselines (RAG and master-slave paradigm) with 13-57% relative improvement in task success and up to 9% relative gain in F1 score for data discovery across both proprietary and open-source LLMs.

Conclusion: The blackboard paradigm establishes a scalable and generalizable communication framework for multi-agent systems in data discovery tasks.

Abstract: The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent’s capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents – either responsible for a partition of the data lake or general information retrieval – volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents’ expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.

[595] SimCity: Multi-Agent Urban Development Simulation with Rich Interactions

Yeqi Feng, Yucheng Lu, Hongyu Su, Tianxing He

Main category: cs.MA

TL;DR: SimCity is a multi-agent framework using LLMs to create realistic macroeconomic simulations with heterogeneous agents and transparent natural-language reasoning, reproducing canonical economic phenomena.

Details

Motivation: To overcome limitations of classical equilibrium models (limited heterogeneity) and traditional agent-based models (hand-crafted rules) by leveraging LLMs for flexible, adaptive behavior in macroeconomic simulations.

Method: Uses four core agent types (households, firms, central bank, government) that interact in frictional labor, heterogeneous goods, and financial markets, with a Vision-Language Model for geographic firm placement and city rendering.

Result: The framework naturally reproduces canonical macroeconomic phenomena including price elasticity of demand, Engel’s Law, Okun’s Law, Phillips Curve, and Beveridge Curve while remaining robust across simulation runs.

Conclusion: SimCity demonstrates that LLMs enable construction of realistic, interpretable macroeconomic simulations that capture both macroeconomic regularities and urban expansion dynamics in a unified environment.

Abstract: Large Language Models (LLMs) open new possibilities for constructing realistic and interpretable macroeconomic simulations. We present SimCity, a multi-agent framework that leverages LLMs to model an interpretable macroeconomic system with heterogeneous agents and rich interactions. Unlike classical equilibrium models that limit heterogeneity for tractability, or traditional agent-based models (ABMs) that rely on hand-crafted decision rules, SimCity enables flexible, adaptive behavior with transparent natural-language reasoning. Within SimCity, four core agent types (households, firms, a central bank, and a government) deliberate and participate in a frictional labor market, a heterogeneous goods market, and a financial market. Furthermore, a Vision-Language Model (VLM) determines the geographic placement of new firms and renders a mapped virtual city, allowing us to study both macroeconomic regularities and urban expansion dynamics within a unified environment. To evaluate the framework, we compile a checklist of canonical macroeconomic phenomena, including price elasticity of demand, Engel’s Law, Okun’s Law, the Phillips Curve, and the Beveridge Curve, and show that SimCity naturally reproduces these empirical patterns while remaining robust across simulation runs.

[596] Implementing Agents in JavaScript

Timotheus Kampik

Main category: cs.MA

TL;DR: Introduction to agent-oriented programming in JavaScript using JS-son library, covering reasoning loop agents, multi-agent systems, and integration with generative AI.

Details

Motivation: To provide practical guidance on implementing agent-oriented programming abstractions in JavaScript and demonstrate integration with modern AI technologies.

Method: Example-based approach starting with vanilla JavaScript implementations, then using JS-son library for advanced agents and multi-agent systems, with integration to large language models.

Result: Walk-through of implementing reasoning loop agents, multi-agent systems, and AI integration in JavaScript ecosystem.

Conclusion: Presents application scenarios across technology ecosystems and outlines future research directions for agent-oriented programming in JavaScript.

Abstract: This chapter gives an introduction to agent-oriented programming in JavaScript. It provides an example-based walk-through of how to implement abstractions for reasoning loop agents in vanilla JavaScript. The initial example is used as a stepping stone for explaining how to implement slightly more advanced agents and multi-agent systems using JS-son, a JavaScript library for agent-oriented programming. In this context, the chapter also explains how to integrate reasoning loop agents with generative AI technologies–specifically, large language models. Finally, application scenarios in several technology ecosystems and future research directions are sketched.

[597] AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

Main category: cs.MA

TL;DR: AniMaker is a multi-agent framework that generates coherent storytelling animations from text by using specialized agents for storyboard generation, video clip creation, evaluation, and post-production, with key innovations in efficient multi-candidate generation and multi-shot animation evaluation.

Details

Motivation: Current video generation methods struggle with creating coherent multi-scene storytelling animations, often producing disjointed narratives due to rigid keyframe conversion and instability in video generation models that can degrade entire animations from single poor clips.

Method: A multi-agent framework with Director Agent (storyboard generation), Photography Agent (video clip generation using MCTS-Gen for efficient candidate exploration), Reviewer Agent (evaluation using AniEval for multi-shot animation assessment), and Post-Production Agent (editing and voiceover).

Result: AniMaker achieves superior quality measured by VBench and the proposed AniEval framework, while significantly improving efficiency in multi-candidate generation, advancing AI-generated storytelling animation toward production standards.

Conclusion: The framework successfully addresses challenges in coherent multi-scene animation generation through specialized agents and novel technical components, demonstrating improved quality and efficiency in storytelling animation production.

Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

cs.MM

Chetwin Low, Weimin Wang, Calder Katyal

Main category: cs.MM

TL;DR: Ovi is a unified audio-video generation model that uses twin-DiT modules with blockwise cross-modal fusion to generate synchronized audio and video in a single process, eliminating the need for separate pipelines or post-processing alignment.

Details

Motivation: To overcome the limitations of complex multi-stage architectures and sequential synthesis in audio-video generation by creating a unified paradigm that models both modalities as a single generative process.

Method: Uses twin-DiT modules with blockwise cross-modal fusion, initializing an audio tower identical to a pretrained video model. Employs scaled-RoPE embeddings for timing exchange and bidirectional cross-attention for semantic exchange during joint training on large video datasets.

Result: Produces cinematic storytelling with natural speech, realistic sound effects, and accurate context-matched audio. The model generates movie-grade video clips with rich speaker identity and emotion in speech.

Conclusion: Ovi demonstrates that unified audio-video generation through blockwise cross-modal fusion can achieve natural synchronization and high-quality results without complex multi-stage pipelines or post hoc alignment.

Abstract: Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

[599] Comparing Contrastive and Triplet Loss in Audio-Visual Embedding: Intra-Class Variance and Greediness Analysis

Donghuo Zeng

Main category: cs.MM

TL;DR: Theoretical and empirical comparison shows triplet loss preserves greater intra- and inter-class variance for finer-grained distinctions, while contrastive loss compacts embeddings and may obscure subtle differences. Triplet loss yields superior performance across classification and retrieval tasks.

Details

Motivation: Contrastive loss and triplet loss are widely used in deep metric learning but their effects on representation quality remain insufficiently understood, requiring systematic comparison.

Method: Theoretical analysis and empirical experiments on synthetic data and real datasets (MNIST, CIFAR-10, CUB-200, CARS196) examining intra-/inter-class variance, optimization behavior (loss-decay rate, active ratio, gradient norm), and performance on classification/retrieval tasks.

Result: Triplet loss preserves greater variance within and across classes, supports finer-grained distinctions, and produces fewer but stronger updates that sustain learning on hard examples. Contrastive loss compacts intra-class embeddings and drives many small updates early on.

Conclusion: Triplet loss yields superior performance across tasks, suggesting its use for detail retention and hard-sample focus, while contrastive loss is better for smoother, broad-based embedding refinement.

Abstract: Contrastive loss and triplet loss are widely used objectives in deep metric learning, yet their effects on representation quality remain insufficiently understood. We present a theoretical and empirical comparison of these losses, focusing on intra- and inter-class variance and optimization behavior (e.g., greedy updates). Through task-specific experiments with consistent settings on synthetic data and real datasets-MNIST, CIFAR-10-it is shown that triplet loss preserves greater variance within and across classes, supporting finer-grained distinctions in the learned representations. In contrast, contrastive loss tends to compact intra-class embeddings, which may obscure subtle semantic differences. To better understand their optimization dynamics, By examining loss-decay rate, active ratio, and gradient norm, we find that contrastive loss drives many small updates early on, while triplet loss produces fewer but stronger updates that sustain learning on hard examples. Finally, across both classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196 datasets, our results consistently show that triplet loss yields superior performance, which suggests using triplet loss for detail retention and hard-sample focus, and contrastive loss for smoother, broad-based embedding refinement.

[600] Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

Main category: cs.MM

TL;DR: Proposed kNN-based Latent Space Broadening for active learning efficiency and Vision-Language Modeling with Audio Enhancement for integrating audio into multimodal models, achieving significant business gains in production.

Details

Motivation: Traditional active learning methods struggle with detecting overconfident misclassifications and distinguishing semantically similar items, while existing multimodal models lack audio integration despite its importance in short-video platforms.

Method: kNN-based Latent Space Broadening (LSB) for active learning enhancement and Vision-Language Modeling with Audio Enhancement (VLMAE) - a mid-fusion approach integrating audio into vision-language models.

Result: The system was deployed in production systems and led to significant business gains.

Conclusion: The proposed methods effectively address limitations of traditional active learning and enable better audio integration in multimodal models, resulting in improved performance and business outcomes.

Abstract: Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

eess.AS

[601] Joint Optimization of Speaker and Spoof Detectors for Spoofing-Robust Automatic Speaker Verification

Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi

Main category: eess.AS

TL;DR: The paper proposes a modular SASV system that combines speaker and spoof detection subsystems using trainable back-end classifiers optimized for the SASV performance metric (a-DCF), achieving state-of-the-art results.

Details

Motivation: To develop spoofing-robust speaker verification systems that effectively combine speaker and spoof detection capabilities through optimized integration of independently trained subsystems.

Method: Used modular design with trainable back-end classifiers to integrate outputs from speaker detection (weighted cosine scoring) and spoof detection (SSL-AASIST) subsystems, directly optimizing for the SASV metric (a-DCF).

Result: Achieved state-of-the-art performance with min a-DCF of 0.196 and SPF-EER of 7.6% on ASVspoof 5 dataset, showing nonlinear score fusion consistently outperforms linear fusion.

Conclusion: Modular design, calibrated integration, and task-aligned optimization are crucial for advancing robust and interpretable SASV systems.

Abstract: Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained subsystems. In this study, we respect similar modularity of the two subsystems, by integrating their outputs using trainable back-end classifiers. In particular, we explore various approaches for directly optimizing the back-end for the recently-proposed SASV performance metric (a-DCF) as a training objective. Our experiments on the ASVspoof 5 dataset demonstrate two important findings: (i) nonlinear score fusion consistently improves a-DCF over linear fusion, and (ii) the combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieves state-of-the-art performance, reducing min a-DCF to 0.196 and SPF-EER to 7.6%. These contributions highlight the importance of modular design, calibrated integration, and task-aligned optimization for advancing robust and interpretable SASV systems.

Angelika Ando, Auguste Crabeil, Adrien Lesage, Rachid Riad

Main category: eess.AS

TL;DR: SLAP is the first audio foundation model that enables zero-shot and out-of-distribution generalization for paralinguistic speech tasks like demographics, voice quality, and health assessment by aligning speech with natural language descriptions through contrastive learning.

Details

Motivation: Current audio foundation models lack zero-shot or OOD generalization capabilities for paralinguistic information such as demographics, voice quality, and health assessment in speech.

Method: SLAP combines a Vision Transformer audio encoder with text encoders trained through contrastive learning on 3400+ hours across 9 datasets with diverse speaker annotations, aligning speech with natural language descriptions of speaker and health metadata.

Result: SLAP achieves 62.9% average F1 in zero-shot evaluation (48% improvement over CLAP), demonstrates strong OOD generalization to unseen languages and clinical populations, and reaches 69.3% F1 with linear probing, with best-in-class 57.9% F1 on health tasks.

Conclusion: SLAP successfully enables zero-shot and OOD generalization for paralinguistic speech analysis, significantly outperforming existing models and establishing new state-of-the-art performance, particularly for health-related speech assessment tasks.

Abstract: Speech encodes paralinguistic information such as demographics, voice quality, and health. Yet no audio foundation model supports zero-shot or out-of-distribution (OOD) generalization to these tasks. We introduce SLAP (Speaker contrastive Language-Audio Pretraining), the first model aligning speech with natural language descriptions of speaker and health metadata through contrastive learning. SLAP combines a Vision Transformer audio encoder with text encoders, trained on more than 3400 hours across 9 datasets with diverse speaker annotations. We evaluated on 38 binary classification tasks spanning demographics, voice characteristics, and clinical assessments across 14 datasets in 7 languages. SLAP achieves 62.9% average F1 in zero-shot evaluation, a 48% relative improvement over CLAP (42.4%), while demonstrating strong OOD generalization to unseen languages and clinical populations. When fine-tuned with linear probing, SLAP reaches 69.3% F1 overall and achieves best-in-class performance on health tasks (57.9% F1), surpassing larger foundation models.

[603] Clustering of Acoustic Environments with Variational Autoencoders for Hearing Devices

Luan Vinícius Fiorio, Ivana Nikoloska, Wim van Houtum, Ronald M. Aarts

Main category: eess.AS

TL;DR: This paper proposes an unsupervised clustering method for acoustic environments using variational autoencoders (VAEs) with Gumbel-Softmax reparameterization and time-context windowing, specifically designed for hearing devices where traditional supervised learning is limited by label availability.

Details

Motivation: Traditional acoustic environment classification methods are limited - classical algorithms can't handle high-dimensional data well, while supervised learning depends on labeled data which may not reflect true acoustic scene structure. Human-imposed labels don't always capture the actual structure of acoustic scenes.

Method: Proposed a VAE model for categorical latent clustering using Gumbel-Softmax reparameterization with time-context windowing scheme. Also introduced general adaptations on VAE architectures for audio clustering tasks.

Result: All variational methods successfully clustered spoken digits (where labels are meaningful), but only the proposed categorical VAE model achieved effective clustering performance on complex urban soundscapes that have strong overlap in time and frequency.

Conclusion: The proposed categorical VAE model with Gumbel-Softmax reparameterization is effective for unsupervised clustering of acoustic environments, particularly for real-world hearing device scenarios where traditional methods fail, especially on complex urban soundscapes.

Abstract: Particularly in hearing devices, the environmental context is taken into account for audio processing, often through classification. Traditional acoustic environment classification relies on classical algorithms, which are unable to extract meaningful representations of high-dimensionality data, or on supervised learning, being limited by the availability of labels. Knowing that human-imposed labels do not always reflect the true structure of acoustic scenes, we explore the (unsupervised) clustering of acoustic environments using variational autoencoders (VAEs), presenting a structured latent space suitable for the task. We propose a VAE model for categorical latent clustering employing a Gumbel-Softmax reparameterization with a time-context windowing scheme, tailored for real-world hearing device scenarios. Additionally, general adaptations on VAE architectures for audio clustering are also proposed. The approaches are validated through the clustering of spoken digits, a simpler task where labels are meaningful, and urban soundscapes, which recordings present strong overlap in time and frequency. While all variational methods succeeded when clustering spoken digits, only the proposed model achieved effective clustering performance on urban acoustic scenes, given its categorical nature.

[604] Learning Time-Graph Frequency Representation for Monaural Speech Enhancement

Tingting Wang, Tianrui Wang, Meng Ge, Qiquan Zhang, Xi Shao

Main category: eess.AS

TL;DR: Proposes a learnable GFT-SVD framework for speech enhancement that constructs adaptive graph topologies using graph shift operators and defines learnable graph Fourier basis via 1-D convolution layers, eliminating matrix inversion issues.

Details

Motivation: Existing GFT-based speech enhancement methods use fixed graph topologies with limited adaptability and suffer from numerical errors and instability due to matrix inversion in both GFT-SVD and GFT-EVD approaches.

Method: Leverages graph shift operators to construct learnable graph topology and defines learnable graph Fourier basis using singular value matrices with 1-D convolution neural layers, avoiding matrix inversion.

Result: The proposed framework eliminates numerical errors and stability problems associated with matrix inversion while providing adaptive and flexible graph representations.

Conclusion: The learnable GFT-SVD framework offers a simple yet effective solution for speech enhancement by addressing limitations of fixed graph topologies and matrix inversion issues in traditional GFT approaches.

Abstract: The Graph Fourier Transform (GFT) has recently demonstrated promising results in speech enhancement. However, existing GFT-based speech enhancement approaches often employ fixed graph topologies to build the graph Fourier basis, whose the representation lacks the adaptively and flexibility. In addition, they suffer from the numerical errors and instability introduced by matrix inversion in GFT based on both Singular Value Decomposition (GFT-SVD) and Eigen Vector Decomposition (GFT-EVD). Motivated by these limitations, this paper propose a simple yet effective learnable GFT-SVD framework for speech enhancement. Specifically, we leverage graph shift operators to construct a learnable graph topology and define a learnable graph Fourier basis by the singular value matrices using 1-D convolution (Conv-1D) neural layer. This eliminates the need for matrix inversion, thereby avoiding the associated numerical errors and stability problem.

eess.IV

[605] An Efficient Quality Metric for Video Frame Interpolation Based on Motion-Field Divergence

Conall Daly, Darren Ramsook, Anil Kokaram

Main category: eess.IV

TL;DR: PSNR_DIV is a novel quality metric for video frame interpolation that enhances PSNR through motion divergence weighting, achieving better correlation with human perception while being significantly faster and more memory-efficient than existing methods.

Details

Motivation: Existing quality metrics like PSNR, SSIM, and LPIPS ignore temporal coherence in video frame interpolation, while specialized metrics like FloLPIPS suffer from computational inefficiency that limits practical application.

Method: The approach enhances PSNR through motion divergence weighting, a technique adapted from archival film restoration that detects temporal inconsistencies by highlighting singularities in motion fields to weight image errors.

Result: Evaluation on BVI-VFI dataset (180 sequences) shows PSNR_DIV achieves +0.09 Pearson Linear Correlation Coefficient improvement over FloLPIPS, while being 2.5x faster and using 4x less memory. Performance remains consistent across content categories and robust to motion estimator used.

Conclusion: PSNR_DIV’s efficiency and accuracy enable fast quality evaluation and practical use as a loss function for training neural networks in video frame interpolation tasks.

Abstract: Video frame interpolation is a fundamental tool for temporal video enhancement, but existing quality metrics struggle to evaluate the perceptual impact of interpolation artefacts effectively. Metrics like PSNR, SSIM and LPIPS ignore temporal coherence. State-of-the-art quality metrics tailored towards video frame interpolation, like FloLPIPS, have been developed but suffer from computational inefficiency that limits their practical application. We present $\text{PSNR}{\text{DIV}}$, a novel full-reference quality metric that enhances PSNR through motion divergence weighting, a technique adapted from archival film restoration where it was developed to detect temporal inconsistencies. Our approach highlights singularities in motion fields which is then used to weight image errors. Evaluation on the BVI-VFI dataset (180 sequences across multiple frame rates, resolutions and interpolation methods) shows $\text{PSNR}{\text{DIV}}$ achieves statistically significant improvements: +0.09 Pearson Linear Correlation Coefficient over FloLPIPS, while being 2.5$\times$ faster and using 4$\times$ less memory. Performance remains consistent across all content categories and are robust to the motion estimator used. The efficiency and accuracy of $\text{PSNR}_{\text{DIV}}$ enables fast quality evaluation and practical use as a loss function for training neural networks for video frame interpolation tasks. An implementation of our metric is available at www.github.com/conalld/psnr-div.

[606] Median2Median: Zero-shot Suppression of Structured Noise in Images

Jianxu Wang, Ge Wang

Main category: eess.IV

TL;DR: M2M is a zero-shot denoising framework for structured noise that generates pseudo-independent sub-image pairs from single noisy inputs using directional interpolation and median filtering, enabling effective denoising without training data.

Details

Motivation: Real-world images often have structured noise with anisotropic correlations that existing methods struggle with. Data-driven approaches need large datasets and have limited generalizability, while zero-shot methods only work for i.i.d. noise.

Method: Uses novel sampling strategy with directional interpolation and generalized median filtering to create pseudo-independent sub-image pairs from single noisy input. Employs randomized assignment to eliminate systematic bias and enable Noise2Noise training.

Result: In realistic simulations, M2M performs on par with state-of-the-art zero-shot methods under i.i.d. noise and consistently outperforms them under correlated noise.

Conclusion: M2M provides an efficient, data-free solution for structured noise suppression and represents the first step toward effective zero-shot denoising beyond the strict i.i.d. assumption.

Abstract: Image denoising is a fundamental problem in computer vision and medical imaging. However, real-world images are often degraded by structured noise with strong anisotropic correlations that existing methods struggle to remove. Most data-driven approaches rely on large datasets with high-quality labels and still suffer from limited generalizability, whereas existing zero-shot methods avoid this limitation but remain effective only for independent and identically distributed (i.i.d.) noise. To address this gap, we propose Median2Median (M2M), a zero-shot denoising framework designed for structured noise. M2M introduces a novel sampling strategy that generates pseudo-independent sub-image pairs from a single noisy input. This strategy leverages directional interpolation and generalized median filtering to adaptively exclude values distorted by structured artifacts. To further enlarge the effective sampling space and eliminate systematic bias, a randomized assignment strategy is employed, ensuring that the sampled sub-image pairs are suitable for Noise2Noise training. In our realistic simulation studies, M2M performs on par with state-of-the-art zero-shot methods under i.i.d. noise, while consistently outperforming them under correlated noise. These findings establish M2M as an efficient, data-free solution for structured noise suppression and mark the first step toward effective zero-shot denoising beyond the strict i.i.d. assumption.

[607] GFSR-Net: Guided Focus via Segment-Wise Relevance Network for Interpretable Deep Learning in Medical Imaging

Jhonatan Contreras, Thomas Bocklitz

Main category: eess.IV

TL;DR: GFSR-Net improves medical AI interpretability by using minimal human annotations to guide model focus toward diagnostically relevant areas, producing more reliable saliency maps without sacrificing accuracy.

Details

Motivation: Deep learning models in medical imaging lack interpretability, often relying on irrelevant image regions or artificial cues, which reduces clinical trust and increases diagnostic risks.

Method: GFSR-Net uses a small number of human annotations to approximate intuitive focus areas, training the model to align its attention with diagnostically meaningful regions without requiring precise boundaries.

Result: Experiments show GFSR achieves comparable or superior accuracy while producing saliency maps that better match human expectations, reducing reliance on irrelevant patterns.

Conclusion: GFSR-Net increases confidence in automated diagnostic tools by improving interpretability and reliability across various medical imaging modalities.

Abstract: Deep learning has achieved remarkable success in medical image analysis, however its adoption in clinical practice is limited by a lack of interpretability. These models often make correct predictions without explaining their reasoning. They may also rely on image regions unrelated to the disease or visual cues, such as annotations, that are not present in real-world conditions. This can reduce trust and increase the risk of misleading diagnoses. We introduce the Guided Focus via Segment-Wise Relevance Network (GFSR-Net), an approach designed to improve interpretability and reliability in medical imaging. GFSR-Net uses a small number of human annotations to approximate where a person would focus within an image intuitively, without requiring precise boundaries or exhaustive markings, making the process fast and practical. During training, the model learns to align its focus with these areas, progressively emphasizing features that carry diagnostic meaning. This guidance works across different types of natural and medical images, including chest X-rays, retinal scans, and dermatological images. Our experiments demonstrate that GFSR achieves comparable or superior accuracy while producing saliency maps that better reflect human expectations. This reduces the reliance on irrelevant patterns and increases confidence in automated diagnostic tools.

[608] MSRepaint: Multiple Sclerosis Repaint with Conditional Denoising Diffusion Implicit Model for Bidirectional Lesion Filling and Synthesis

Jinwei Zhang, Lianrui Zuo, Yihao Liu, Hang Zhang, Samuel W. Remedios, Bennett A. Landman, Peter A. Calabresi, Shiv Saidha, Scott D. Newsome, Dzung L. Pham, Jerry L. Prince, Ellen M. Mowry, Aaron Carass

Main category: eess.IV

TL;DR: MSRepaint is a unified diffusion-based generative model for bidirectional lesion filling and synthesis in multiple sclerosis MRI analysis, improving both downstream tasks and segmentation through realistic data generation.

Details

Motivation: Lesions in multiple sclerosis interfere with automated MRI analyses like brain parcellation and registration, while lesion segmentation models face limited annotated training data availability.

Method: MSRepaint uses diffusion-based generative modeling with spatial lesion mask conditioning, contrast dropout for missing inputs, repainting mechanism to preserve surrounding anatomy, and multi-view DDIM inversion with fusion for 3D consistency.

Result: MSRepaint outperforms traditional lesion filling methods (FSL, NiftySeg) and matches FastSurfer-LIT accuracy while being 20x faster. For lesion synthesis, models trained on MSRepaint data outperform those trained on CarveMix or real ISBI data across multiple benchmarks.

Conclusion: MSRepaint provides a unified bidirectional approach for lesion filling and synthesis with full spatial control, enabling high-fidelity simulation of lesion evolution in longitudinal MS progression.

Abstract: In multiple sclerosis, lesions interfere with automated magnetic resonance imaging analyses such as brain parcellation and deformable registration, while lesion segmentation models are hindered by the limited availability of annotated training data. To address both issues, we propose MSRepaint, a unified diffusion-based generative model for bidirectional lesion filling and synthesis that restores anatomical continuity for downstream analyses and augments segmentation through realistic data generation. MSRepaint conditions on spatial lesion masks for voxel-level control, incorporates contrast dropout to handle missing inputs, integrates a repainting mechanism to preserve surrounding anatomy during lesion filling and synthesis, and employs a multi-view DDIM inversion and fusion pipeline for 3D consistency with fast inference. Extensive evaluations demonstrate the effectiveness of MSRepaint across multiple tasks. For lesion filling, we evaluate both the accuracy within the filled regions and the impact on downstream tasks including brain parcellation and deformable registration. MSRepaint outperforms the traditional lesion filling methods FSL and NiftySeg, and achieves accuracy on par with FastSurfer-LIT, a recent diffusion model-based inpainting method, while offering over 20 times faster inference. For lesion synthesis, state-of-the-art MS lesion segmentation models trained on MSRepaint-synthesized data outperform those trained on CarveMix-synthesized data or real ISBI challenge training data across multiple benchmarks, including the MICCAI 2016 and UMCL datasets. Additionally, we demonstrate that MSRepaint’s unified bidirectional filling and synthesis capability, with full spatial control over lesion appearance, enables high-fidelity simulation of lesion evolution in longitudinal MS progression.

[609] SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

Jong Bum Won, Wesley De Neve, Joris Vankerschaver, Utku Ozbulak

Main category: eess.IV

TL;DR: SpurBreast is a curated breast MRI dataset designed to study spurious correlations in medical imaging AI, identifying magnetic field strength and image orientation as key confounding factors that can mislead DNNs.

Details

Motivation: To address the challenge of spurious correlations in medical imaging AI, where models learn non-clinical features instead of meaningful medical patterns, due to limitations in existing datasets.

Method: Created SpurBreast dataset by analyzing over 100 features and identifying two dominant spurious signals: magnetic field strength (global feature) and image orientation (local feature), with controlled dataset splits to evaluate model generalization.

Result: DNNs can exploit these non-clinical signals to achieve high validation accuracy but fail to generalize to unbiased test data, demonstrating the risk of spurious correlations in medical AI.

Conclusion: SpurBreast enables systematic investigation of clinically relevant vs irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies in medical imaging AI.

Abstract: Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.

[610] Measurement-Guided Consistency Model Sampling for Inverse Problems

Amirreza Tanevardi, Pooria Abbas Rad Moghadam, Sajjad Amini

Main category: eess.IV

TL;DR: A modified consistency sampling method for inverse imaging problems that guides stochasticity with measurement-consistency, enabling efficient high-quality reconstruction in few steps.

Details

Motivation: Diffusion models are slow for inverse problems, and consistency models' potential for inverse problems is underexplored despite their efficiency.

Method: Modified consistency sampling with measurement-consistency mechanism that enforces fidelity to measurements while maintaining efficiency.

Result: Improved perceptual and pixel-level metrics (FID, KID, PSNR, SSIM) on Fashion-MNIST and LSUN Bedroom datasets compared to baseline consistency sampling.

Conclusion: The approach achieves competitive or superior reconstructions with only a handful of steps, making it practical for deployment.

Abstract: Diffusion models have become powerful generative priors for solving inverse imaging problems, but their reliance on slow multi-step sampling limits practical deployment. Consistency models address this bottleneck by enabling high-quality generation in a single or only a few steps, yet their direct adaptation to inverse problems is underexplored. In this paper, we present a modified consistency sampling approach tailored for inverse problem reconstruction: the sampler’s stochasticity is guided by a measurement-consistency mechanism tied to the measurement operator, which enforces fidelity to the acquired measurements while retaining the efficiency of consistency-based generation. Experiments on Fashion-MNIST and LSUN Bedroom datasets demonstrate consistent improvements in perceptual and pixel-level metrics, including Fr'echet Inception Distance, Kernel Inception Distance, peak signal-to-noise ratio, and structural similarity index measure, compared to baseline consistency sampling, yielding competitive or superior reconstructions with only a handful of steps.

[611] SUPER-Net: Trustworthy Image Segmentation via Uncertainty Propagation in Encoder-Decoder Networks

Giuseppina Carannante, Nidhal C. Bouaynaya, Dimah Dera, Hassan M. Fathallah-Shaykh, Ghulam Rasool

Main category: eess.IV

TL;DR: SUPER-Net is a Bayesian framework for trustworthy image segmentation that generates both segmentation results and pixel-wise uncertainty maps without expensive Monte Carlo sampling, showing superior robustness and accuracy under noisy and adversarial conditions.

Details

Motivation: Current deep learning models for image segmentation lack uncertainty quantification and are brittle to noisy and out-of-distribution inputs, limiting their deployment in sensitive fields like medical imaging.

Method: SUPER-Net uses Taylor series approximations to propagate the mean and covariance of the model’s posterior distribution across nonlinear layers, enabling simultaneous generation of segmented images and pixel-wise uncertainty maps.

Result: Extensive evaluation on MRI and CT scans shows SUPER-Net outperforms state-of-the-art models in robustness and accuracy under various noisy and adversarial conditions, with uncertainty maps effectively identifying low-confidence areas affected by noise or attacks.

Conclusion: SUPER-Net provides a trustworthy image segmentation solution that enables self-assessment of segmentation reliability through uncertainty quantification, particularly valuable when errors arise from noise or adversarial examples.

Abstract: Deep Learning (DL) holds great promise in reshaping the industry owing to its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in sensitive fields. Current models often lack uncertainty quantification, providing only point estimates. We propose SUPER-Net, a Bayesian framework for trustworthy image segmentation via uncertainty propagation. Using Taylor series approximations, SUPER-Net propagates the mean and covariance of the model’s posterior distribution across nonlinear layers. It generates two outputs simultaneously: the segmented image and a pixel-wise uncertainty map, eliminating the need for expensive Monte Carlo sampling. SUPER-Net’s performance is extensively evaluated on MRI and CT scans under various noisy and adversarial conditions. Results show that SUPER-Net outperforms state-of-the-art models in robustness and accuracy. The uncertainty map identifies low-confidence areas affected by noise or attacks, allowing the model to self-assess segmentation reliability, particularly when errors arise from noise or adversarial examples.

[612] A Tutorial on MRI Reconstruction: From Modern Methods to Clinical Implications

Tolga Çukur, Salman U. H. Dar, Valiyeh Ansarian Nezhad, Yohan Jun, Tae Hyung Kim, Shohei Fujita, Berkin Bilgic

Main category: eess.IV

TL;DR: This tutorial paper provides an overview of MRI reconstruction methods, covering classical approaches with hand-crafted priors and modern deep learning methods that combine learned and crafted priors to accelerate MRI acquisitions while preserving diagnostic quality.

Details

Motivation: MRI scans are clinically essential but time-consuming, reducing patient throughput and increasing motion artifacts. There's a need to accelerate acquisitions without compromising diagnostic quality.

Method: The paper reviews MRI reconstruction basics and state-of-the-art approaches, including classical methods with explicit hand-crafted priors and deep learning methods that leverage learned priors. It includes a Python toolbox for practical demonstration.

Result: Advances in reconstruction algorithms, hardware, and pulse sequence design have enabled accelerated MRI acquisitions while maintaining diagnostic quality through the incorporation of prior information.

Conclusion: The tutorial explores translational aspects and clinical implications of MRI reconstruction methods, discusses future directions to address remaining challenges, and provides practical tools for implementation.

Abstract: MRI is an indispensable clinical tool, offering a rich variety of tissue contrasts to support broad diagnostic and research applications. Clinical exams routinely acquire multiple structural sequences that provide complementary information for differential diagnosis, while research protocols often incorporate advanced functional, diffusion, spectroscopic, and relaxometry sequences to capture multidimensional insights into tissue structure and composition. However, these capabilities come at the cost of prolonged scan times, which reduce patient throughput, increase susceptibility to motion artifacts, and may require trade-offs in image quality or diagnostic scope. Over the last two decades, advances in image reconstruction algorithms–alongside improvements in hardware and pulse sequence design–have made it possible to accelerate acquisitions while preserving diagnostic quality. Central to this progress is the ability to incorporate prior information to regularize the solutions to the reconstruction problem. In this tutorial, we overview the basics of MRI reconstruction and highlight state-of-the-art approaches, beginning with classical methods that rely on explicit hand-crafted priors, and then turning to deep learning methods that leverage a combination of learned and crafted priors to further push the performance envelope. We also explore the translational aspects and eventual clinical implications of these methods. We conclude by discussing future directions to address remaining challenges in MRI reconstruction. The tutorial is accompanied by a Python toolbox (https://github.com/tutorial-MRI-recon/tutorial) to demonstrate select methods discussed in the article.

[613] UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction

Zhi Chen, Le Zhang

Main category: eess.IV

TL;DR: UltraUPConvNet is a computationally efficient universal framework for both ultrasound image classification and segmentation, achieving state-of-the-art performance on certain datasets with lower computational overhead.

Details

Motivation: Current AI research treats disease prediction and tissue segmentation as separate tasks requiring substantial computational overhead, while ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety.

Method: Introduces UltraUPConvNet, a universal framework trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions for both ultrasound image classification and segmentation.

Result: Achieves state-of-the-art performance on certain datasets with lower computational overhead.

Conclusion: The model provides an efficient solution for ultrasound image analysis, with model weights and codes publicly available.

Abstract: Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet

[614] Improving Virtual Contrast Enhancement using Longitudinal Data

Pierre Fayolle, Alexandre Bône, Noëlie Debs, Philippe Robert, Pascal Bourdon, Remy Guillevin, David Helbert

Main category: eess.IV

TL;DR: Deep learning framework for virtual contrast enhancement of MRI images using longitudinal data to reduce gadolinium contrast agent dosage while maintaining diagnostic quality.

Details

Motivation: Concerns about gadolinium retention and accumulation in brain tissues from frequent contrast agent injections in neuro-oncology monitoring necessitate strategies to reduce dosage.

Method: Proposed deep learning model that uses longitudinal information by incorporating prior full-dose MRI exams from the same patient to enhance low-dose acquisitions.

Result: Longitudinal approach significantly improved image quality across multiple reconstruction metrics compared to non-longitudinal models, and showed robustness with varying simulated contrast doses.

Conclusion: Integrating prior imaging history into deep learning pipelines can reduce gadolinium contrast agent usage without compromising diagnostic utility, enabling safer longitudinal monitoring in clinical MRI.

Abstract: Gadolinium-based contrast agents (GBCAs) are widely used in magnetic resonance imaging (MRI) to enhance lesion detection and characterisation, particularly in the field of neuro-oncology. Nevertheless, concerns regarding gadolinium retention and accumulation in brain and body tissues, most notably for diseases that require close monitoring and frequent GBCA injection, have led to the need for strategies to reduce dosage. In this study, a deep learning framework is proposed for the virtual contrast enhancement of full-dose post-contrast T1-weighted MRI images from corresponding low-dose acquisitions. The contribution of the presented model is its utilisation of longitudinal information, which is achieved by incorporating a prior full-dose MRI examination from the same patient. A comparative evaluation against a non-longitudinal single session model demonstrated that the longitudinal approach significantly improves image quality across multiple reconstruction metrics. Furthermore, experiments with varying simulated contrast doses confirmed the robustness of the proposed method. These results emphasize the potential of integrating prior imaging history into deep learning-based virtual contrast enhancement pipelines to reduce GBCA usage without compromising diagnostic utility, thus paving the way for safer, more sustainable longitudinal monitoring in clinical MRI practice.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[2] Towards Open-Ended Discovery for Low-Resource NLP

[3] SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models

[4] Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs

[5] Context Matters: Comparison of commercial large language tools in veterinary medicine

[6] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

[7] ClaimCheck: Real-Time Fact-Checking with Small Language Models

[8] FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol

[9] EEFSUVA: A New Mathematical Olympiad Benchmark

[10] Who is In Charge? Dissecting Role Conflicts in Instruction Following

[11] Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

[12] Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision

[13] Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

[14] Geometric Structures and Patterns of Meaning: A PHATE Manifold Analysis of Chinese Character Embeddings

[15] Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models

[16] Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

[17] Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition

[18] LLMRank: Understanding LLM Strengths for Model Routing

[19] GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings

[20] Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation

[21] Silent Tokens, Loud Effects: Padding in LLMs

[22] CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM

[23] SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

[24] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

[25] Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

[26] Feasibility of Structuring Stress Documentation Using an Ontology-Guided Large Language Model

[27] SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction

[28] A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

[29] Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports

[30] SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs

[31] LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

[32] GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

[33] Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

[34] GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

[35] Longitudinal Monitoring of LLM Content Moderation of Social Issues

[36] RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs

[37] Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse

[38] In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

[39] OpenAI’s GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

[40] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

[41] Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

[42] TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

[43] LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews

[44] TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

[45] Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing

[46] HiSpec: Hierarchical Speculative Decoding for LLMs

[47] TAG-EQA: Text-And-Graph for Event Question Answering via Structured Prompting Strategies

[48] A-VERT: Agnostic Verification with Embedding Ranking Targets

[49] One More Question is Enough, Expert Question Decomposition (EQD) Model for Domain Quantitative Reasoning

[50] ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

[51] CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

[52] A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation

[53] RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering

[54] Efficient Training of Robust Traditional Chinese LLaMA-1B on a Single Consumer GPU: Continual Pre-training, SFT, and DPO

[55] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

[56] NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

[57] Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

[58] SoK: Measuring What Matters for Closed-Loop Security Agents

[59] MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization

[60] How Do Language Models Compose Functions?

[61] Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

[62] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

[63] Machine-interpretable Engineering Design Standards for Valve Specification

[64] Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

[65] Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction

[66] Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

[67] Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors

[68] SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

[69] Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

[70] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

[71] Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

[72] Inverse Language Modeling towards Robust and Grounded LLMs

[73] Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning

[74] Taking a SEAT: Predicting Value Interpretations from Sentiment, Emotion, Argument, and Topic Annotations

[75] Exploring Database Normalization Effects on SQL Generation

[76] LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target

[77] Style Over Story: A Process-Oriented Study of Authorial Creativity in Large Language Models