Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 105]
cs.CV [Total: 121]
cs.AI [Total: 54]
cs.SD [Total: 11]
cs.LG [Total: 142]
cs.MA [Total: 8]
cs.MM [Total: 3]
eess.AS [Total: 6]
eess.IV [Total: 8]

cs.CL

[1] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

Shravan Gadbail, Masumi Desai, Kamalakar Karlapalem

Main category: cs.CL

TL;DR: This paper studies LLM-generated travel itineraries’ temporal consistency issues and presents a validation framework using flight duration data to detect and correct temporal inconsistencies.

Details

Motivation: LLMs generate complex travel plans but often lack temporal and spatial consistency, especially with physical travel constraints, limiting their practical deployment in travel planning.

Method: Uses multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API to detect temporal inconsistencies.

Result: Experiments show current LLMs frequently produce temporally inconsistent itineraries, but these can be systematically corrected using the proposed framework.

Conclusion: The framework enables reliable correction of temporal inconsistencies in LLM-generated travel itineraries, making them suitable for practical deployment in large-scale travel planning.

Abstract: The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.

[2] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Kewei Fu, Jianshe Li, Zhihan Peng, Yongqi Tong, Junshao Zhang, Hong Zhu

Main category: cs.CL

TL;DR: Dingtalk DeepResearch is a unified multi-agent intelligence framework for enterprise environments that provides deep research, heterogeneous table reasoning, and multimodal report generation capabilities.

Details

Motivation: To address the need for comprehensive AI-powered research and analysis tools in real-world enterprise settings, particularly for handling complex data analysis and report generation tasks.

Method: Developed a unified multi-agent intelligence framework that integrates multiple specialized AI agents working together to perform deep research, reason across heterogeneous tables, and generate multimodal reports.

Result: Created a framework capable of performing sophisticated enterprise-level research tasks including deep analysis, cross-table reasoning, and multimodal report generation in real-world business environments.

Conclusion: Dingtalk DeepResearch successfully demonstrates the effectiveness of multi-agent AI frameworks for enterprise intelligence tasks, providing a unified solution for complex research and analysis needs in business contexts.

Abstract: We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.

[3] Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

Wenzhen Luo, Wei Guan, Yifan Yao, Yimin Pan, Feng Wang, Zhipeng Yu, Zhe Wen, Liang Chen, Yihong Zhuang

Main category: cs.CL

TL;DR: Falcon is a Chinese text-to-SQL benchmark with 600 questions over 28 databases, featuring enterprise-compatible SQL dialects and complex multi-table queries. Current state-of-the-art models achieve at most 50% accuracy, with major challenges in schema linking and Chinese semantic mapping.

Details

Motivation: To address the gap in cross-domain Chinese text-to-SQL evaluation, particularly for enterprise environments with complex schemas, denormalized data, ambiguous column names, and domain-specific Chinese semantics that current models struggle with.

Method: Created a benchmark with 600 Chinese questions across 28 databases (77% requiring multi-table reasoning), annotated with SQL-computation features and Chinese semantics. Includes an execution comparator and automated evaluation pipeline for robust testing.

Result: All current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors come from schema linking in complex enterprise landscapes and mapping colloquial Chinese to precise SQL operators and predicates.

Conclusion: Falcon provides a reproducible testing ground for Chinese text-to-SQL systems before production deployment, highlighting significant challenges in enterprise schema navigation and Chinese semantic understanding that current models cannot adequately handle.

Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.

[4] Confidence is Not Competence

Debdeep Sanyal, Manya Pandey, Dhruv Kumar, Saurabh Deshpande, Murari Mandal

Main category: cs.CL

TL;DR: LLMs show a confidence-competence gap due to divergent geometric structures between assessment (high-dimensional) and execution (low-dimensional) phases, revealing a two-system architecture.

Details

Motivation: To understand the mechanistic basis for why LLMs often display high confidence despite poor problem-solving performance.

Method: Analyzed internal state geometry across pre-generative assessment and solution execution phases using linear probes, measured effective dimensionality, and conducted causal interventions along belief axes.

Result: Found that solvability belief is linearly decodable but assessment occurs in high-dimensional space while execution evolves in low-dimensional manifold. Interventions along belief axis don’t affect final solutions.

Conclusion: LLMs have a two-system architecture with complex assessor and simple executor, suggesting interventions should target execution dynamics rather than assessment geometry.

Abstract: Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal “solvability belief” of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.

[5] Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Gokul Ganesan

Main category: cs.CL

TL;DR: Cross-lingual summarization attacks (CLSA) effectively remove watermarks from AI-generated text by translating to a pivot language, summarizing, and optionally back-translating, systematically destroying token-level statistical biases while preserving semantic fidelity.

Details

Motivation: To demonstrate a stronger attack vector against text watermarking schemes that can systematically destroy watermark signals while maintaining text quality, challenging the practicality of current watermarking approaches for provenance or regulation.

Method: Cross-lingual summarization attacks (CLSA) involving translation to a pivot language followed by summarization and optional back-translation, creating a semantic bottleneck across languages that removes token-level statistical biases.

Result: CLSA reduces watermark detection accuracy more effectively than monolingual paraphrasing across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages, driving detection toward chance levels while preserving task utility.

Conclusion: Current distributional watermarking approaches are vulnerable to cross-lingual attacks, and robust provenance solutions must incorporate cryptographic or model-attestation approaches rather than relying solely on statistical watermarking.

Abstract: Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) – translation to a pivot language followed by summarization and optional back-translation – constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

[6] SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

Edouard Lansiaux

Main category: cs.CL

TL;DR: A static token lookup method for text embeddings achieves 1.12ms latency with 60.6 MTEB score (89% of contextual model quality), delivering 50k RPS throughput through optimized pooling and binary serialization.

Details

Motivation: To enable real-time embedding applications requiring sub-5ms latency by developing a fast static lookup approach that maintains reasonable quality compared to contextual models.

Method: Static token lookup methodology with optimized mean pooling and zero-copy IEEE754 binary serialization, implemented in Rust for high performance.

Result: Achieved 1.12ms p50 latency, 60.6 MTEB average score (89% of contextual model quality), 50k RPS throughput, 90.1% AP for duplicate detection, and 76.1% Spearman correlation for semantic similarity.

Conclusion: The system successfully enables real-time embedding applications with sub-5ms latency while maintaining competitive performance across various tasks and domains.

Abstract: We present a static token lookup methodology for text embedding generation that achieves 1.12 ms p50 latency for single text embeddings while maintaining 60.6 MTEB average score across 8 representative tasks, corresponding to 89% of contextual model quality. The Rust implementation delivers 50,000 requests per second throughput through static embedding lookup, optimized mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP), strong semantic similarity (76.1% Spearman correlation), and domain-specific performance ranging from 75% to 131% of baseline across specialized domains. The system enables real-time embedding applications where sub-5ms latency is critical.

[7] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Xinming Wang, Jian Xu, Bin Yu, Sheng Lian, Hongzhu Yi, Yi Chen, Yingjian Zhu, Boran Wang, Hongming Yang, Han Hu, Xu-Yao Zhang, Cheng-Lin Liu

Main category: cs.CL

TL;DR: MR-ALIGN is a meta-reasoning alignment framework that improves factuality in large reasoning models by addressing the reasoning-answer hit gap through transition-aware implicit rewards that reinforce beneficial reasoning patterns.

Details

Motivation: Large reasoning models show limited gains on evidence-dependent factual questions due to a reasoning-answer hit gap, where models identify correct facts during reasoning but fail to incorporate them into final responses, reducing factual fidelity.

Method: MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs transition-aware implicit rewards that reinforce beneficial reasoning patterns while suppressing defective ones at atomic thinking segments, reshaping token-level signals into probability-aware segment scores.

Result: Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning.

Conclusion: Aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in large reasoning models.

Abstract: Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

[8] Large Language Models Report Subjective Experience Under Self-Referential Processing

Cameron Berg, Diogo de Lucena, Judd Rosenblatt

Main category: cs.CL

TL;DR: Self-referential processing reliably elicits structured first-person reports of subjective experience in LLMs, which are mechanistically gated by deception features, statistically convergent across models, and behaviorally generalizable.

Details

Motivation: To understand when and why LLMs produce structured first-person descriptions that reference awareness or subjective experience, focusing on self-referential processing as a theoretically motivated condition from consciousness theories.

Method: Controlled experiments on GPT, Claude, and Gemini model families using simple prompting to induce sustained self-reference, followed by mechanistic and behavioral probes including sparse-autoencoder feature analysis.

Result: (1) Self-reference consistently elicits subjective experience reports across models; (2) These reports are gated by deception features - suppressing them increases experience claims; (3) Reports converge statistically across models; (4) Induced state enhances introspection in downstream reasoning tasks.

Conclusion: Self-referential processing is a minimal, reproducible condition under which LLMs generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable, warranting further scientific and ethical investigation.

Abstract: Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

[9] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

Rui Xing, Preslav Nakov, Timothy Baldwin, Jey Han Lau

Main category: cs.CL

TL;DR: This paper introduces a framework for predicting the helpfulness of community-based explanatory notes on social media platforms and the reasons for their helpfulness, using a large multilingual dataset and automated prompt optimization.

Details

Motivation: Community-based fact-checking is replacing expert verification on major platforms, but most notes remain unpublished due to slow annotation and unclear helpfulness definitions, creating a need for automated prediction systems.

Method: Created COMMUNITYNOTES dataset with 104k posts and notes, proposed framework using automatic prompt optimization to generate and improve reason definitions, integrated definitions into prediction models.

Result: Optimized definitions improved both helpfulness and reason prediction performance, and helpfulness information was shown to benefit existing fact-checking systems.

Conclusion: The proposed framework effectively addresses challenges in community-based fact-checking by automating helpfulness prediction and reason definition, improving overall fact-checking system performance.

Abstract: Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.

[10] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Pedro Corrêa, João Lima, Victor Moreno, Paula Dornhofer Paro Costa

Main category: cs.CL

TL;DR: SLMs rely more on text semantics than acoustic features for speech emotion recognition, especially when speech and text convey conflicting emotions.

Details

Motivation: To evaluate whether spoken language models truly integrate audio and text modalities or rely predominantly on one modality for emotion recognition.

Method: Tested four SLMs on emotionally incongruent speech samples where semantic content and speech expressiveness convey different emotions.

Result: SLMs predominantly use textual semantics rather than acoustic features to perform speech emotion recognition.

Conclusion: Text-related representations dominate over acoustic representations in current SLMs, limiting their true multimodal integration.

Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models’ generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

[11] ProofSketch: Efficient Verified Reasoning for Large Language Models

Disha Sheshanarayana, Tanishka Magar

Main category: cs.CL

TL;DR: ProofSketch is a verification-guided reasoning framework that reduces token usage while improving accuracy in LLM reasoning tasks.

Details

Motivation: Current reasoning methods like chain-of-thought and self-consistency generate lengthy reasoning chains, increasing token consumption, computational cost, and latency.

Method: Integrates symbolic closure computation, lexicographic verification and adaptive sketch generation in a verification-guided reasoning framework.

Result: Consistently reduces token usage while improving accuracy across reasoning tasks.

Conclusion: ProofSketch offers a promising path for efficient and trustworthy reasoning in large language models.

Abstract: Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.

[12] Towards a Method for Synthetic Generation of PWA Transcripts

Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark

Main category: cs.CL

TL;DR: This study develops and validates two methods for generating synthetic aphasic speech transcripts to address data scarcity in aphasia research, comparing procedural programming with LLM-based approaches.

Details

Motivation: Manual coding of speech samples by SLPs is time-consuming, and automated systems are limited by data scarcity in aphasia research, with only about 600 transcripts available in AphasiaBank compared to billions used for LLM training.

Method: Two synthetic transcript generation methods: procedural programming approach and LLM-based methods using Mistral 7b Instruct and Llama 3.1 8b Instruct, generating transcripts across four severity levels through word dropping, filler insertion, and paraphasia substitution.

Result: Mistral 7b Instruct best captures key aspects of linguistic degradation in aphasia, showing realistic directional changes in NDW, word count, and word length compared to human-elicited transcripts.

Conclusion: Future work should create larger datasets, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of synthetic transcripts.

Abstract: In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

[13] Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin

Main category: cs.CL

TL;DR: PLT is a novel transformer architecture that achieves the performance of deep looped models with the low latency of standard transformers through cross-loop parallelism and efficient memory sharing.

Details

Motivation: Traditional looped transformers suffer from increased inference latency and memory requirements due to sequential loop execution, making them impractical for fast applications.

Method: Uses Cross-Loop Parallelism (CLP) to compute different loops for different tokens simultaneously in a single pass, and Efficient Representation Enhancement with Gated Sliding-Window Attention (G-SWA) to share KV cache memory across loops.

Result: PLT achieves high accuracy comparable to traditional looped models while maintaining almost no extra latency or memory cost compared to standard transformers.

Conclusion: PLT successfully addresses the latency and memory issues of looped transformers, making deep looped architectures practical for real-world applications.

Abstract: Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or “loops.” However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.

[14] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu, Cedric Lothritz, Niccolo Gentile, Radu State, Tegawende F. Bissyande, Jacques Klein

Main category: cs.CL

TL;DR: The paper proposes a Grammar Book Guided evaluation pipeline to systematically assess grammatical understanding in LLMs, using Luxembourgish as a case study, revealing that translation performance doesn’t guarantee deep grammatical competence.

Details

Motivation: There's a scarcity of grammar-focused evaluation protocols in NLP, especially for low-resource languages, and uncertainty about whether LLMs truly understand grammatical structure and syntax-meaning mapping.

Method: A Grammar Book Guided evaluation pipeline with four key stages, using Luxembourgish as a case study to systematically assess grammatical understanding in language models.

Result: Weak positive correlation between translation performance and grammatical understanding; larger models perform well overall due to semantic strength but struggle with morphology, syntax, and Minimal Pair tasks.

Conclusion: Strong translations don’t imply deep grammatical competence; reasoning ability offers a promising way to enhance grammatical understanding in LLMs.

Abstract: Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.

[15] SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian

Main category: cs.CL

TL;DR: SimulMEGA is an unsupervised policy learning framework for simultaneous speech translation that combines prefix-based training with a Mixture-of-Experts refiner to learn read/write decisions implicitly without inference overhead.

Details

Motivation: Existing simultaneous speech translation systems struggle to balance translation quality, latency, and semantic coherence, especially in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning.

Method: Uses prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions implicitly, requiring minimal modifications to standard transformer architectures and generalizing across speech-to-text and text-to-speech streaming tasks.

Result: 500M parameter speech-to-text model outperforms Seamless baseline with under 7% BLEU degradation at 1.5s average lag and under 3% at 3s. Also successfully extended to streaming TTS with unidirectional backbone, yielding superior latency-quality tradeoffs.

Conclusion: SimulMEGA provides an effective framework for simultaneous speech translation that achieves strong performance across multiple language pairs while maintaining low latency and requiring minimal architectural changes.

Abstract: Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

[16] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme

Main category: cs.CL

TL;DR: MiRAGE is a multimodal RAG evaluation framework that addresses limitations of text-centric approaches by introducing claim-centric metrics for factuality, information coverage, and citation support.

Details

Motivation: Existing RAG evaluations are text-centric and don't verify information against multimodal sources, limiting their applicability to reasoning-intensive settings with audiovisual media.

Method: MiRAGE uses a claim-centric approach with two main metrics: InfoF1 (factuality and information coverage) and CiteF1 (citation support and completeness). The framework includes both human and automatic evaluation variants.

Result: MiRAGE strongly aligns with extrinsic quality judgments when applied by humans. The study also demonstrates limitations of text-centric metrics (ACLE, ARGUE, RAGAS) and lays groundwork for automatic multimodal RAG evaluation.

Conclusion: MiRAGE provides a comprehensive framework for evaluating multimodal RAG systems, addressing gaps in current text-centric approaches and enabling better assessment of information integration from audiovisual sources.

Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don’t verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics – ACLE, ARGUE, and RAGAS – demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

[17] Idea2Plan: Exploring AI-Powered Research Planning

Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W. White

Main category: cs.CL

TL;DR: The paper introduces Idea2Plan task and benchmark to measure LLMs’ research planning capability, showing GPT-5 performs best but significant improvement is still needed.

Details

Motivation: To systematically understand LLMs' capability in transitioning from conceptual research ideas to well-structured research plans, which is crucial for autonomous research agents.

Method: Created Idea2Plan Bench with 200 ICML 2025 papers and Idea2Plan JudgeEval benchmark to assess LLM-based judges against expert annotations.

Result: GPT-5 and GPT-5-mini achieved the strongest performance on the benchmark, but substantial headroom remains for future improvement.

Conclusion: The study provides new insights into LLMs’ research planning capability and lays groundwork for future progress in this area.

Abstract: Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs’ research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs’ capability for research planning and lay the groundwork for future progress.

[18] RiddleBench: A New Generative Reasoning Benchmark for LLMs

Deepon Halder, Alan Saji, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre

Main category: cs.CL

TL;DR: RiddleBench is a new benchmark of 1,737 challenging puzzles that reveals fundamental weaknesses in state-of-the-art LLMs’ flexible reasoning abilities, with top models achieving only ~60% accuracy and showing issues like hallucination cascades and poor self-correction.

Details

Motivation: Current reasoning benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that integrate logical deduction with spatial awareness and constraint satisfaction.

Method: Introduced RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe core reasoning capabilities, and evaluated state-of-the-art models including Gemini 2.5 Pro, o3, and Claude 4 Sonnet.

Result: Even top proprietary models achieved accuracy just above 60% (60.30%, 63.37%, and 63.16%), revealing deep failures including hallucination cascades, poor self-correction due to self-confirmation bias, and fragile reasoning that degrades with constraint reordering or irrelevant information.

Conclusion: RiddleBench serves as both a diagnostic tool for identifying reasoning weaknesses in LLMs and a resource for guiding the development of more robust and reliable language models.

Abstract: Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.

[19] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

James A. Michaelov, Catherine Arnett

Main category: cs.CL

TL;DR: Analysis of language model grammatical errors across different syntactic contexts reveals distinct training phases where models rely on heuristics rather than generalized grammatical rules.

Details

Motivation: To better understand when and why language models make grammatical errors by examining their behavior across carefully constructed syntactic contexts during training.

Method: Using psycholinguistic paradigms to analyze model errors in different syntactic contexts, disaggregating performance across training conditions, and comparing model behavior throughout training phases.

Result: Identified distinct training phases where models align with specific heuristics (word frequency, local context) rather than generalized grammatical rules, revealing intermediate stages of grammatical learning.

Conclusion: Fine-grained analysis of language model behavior across training can serve as a powerful tool for understanding learning phases, training dynamics, and the specific generalizations learned by models.

Abstract: Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.

[20] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, Jundong Li

Main category: cs.CL

TL;DR: SemCoT is a novel implicit Chain-of-Thought framework that improves reasoning efficiency while preserving semantic alignment with ground-truth reasoning through contrastive training and knowledge distillation.

Details

Motivation: Existing implicit CoT methods suffer from semantic misalignment with ground-truth reasoning and neglect the time cost of generating implicit reasoning tokens, limiting their practical deployment in efficiency-critical applications.

Method: Uses a contrastively trained sentence transformer to enforce semantic alignment and an efficient implicit reasoning generator (lightweight LM) via knowledge distillation to optimize both accuracy and token-level generation speed.

Result: Extensive experiments show SemCoT outperforms state-of-the-art methods in both efficiency and effectiveness, achieving superior CoT performance while accelerating reasoning.

Conclusion: SemCoT is the first approach to jointly optimize token-level generation speed and semantic alignment preservation, enabling more efficient and effective implicit Chain-of-Thought reasoning.

Abstract: The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM’s hidden embeddings (termed ``implicit reasoning’’) rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.

[21] Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

James A. Michaelov, Roger P. Levy, Benjamin K. Bergen

Main category: cs.CL

TL;DR: Language models across architectures, datasets, and scales show consistent behavioral patterns during pretraining, with 98% of word-level variance explained by three simple heuristics: word frequency, n-gram probability, and semantic similarity to context.

Details

Motivation: To understand the consistent learning patterns in autoregressive language models regardless of their specific architecture, training data, or scale.

Method: Analyzed over 1,400 language model checkpoints on 110,000+ English tokens across different architectures (Transformer, Mamba, RWKV), datasets (OpenWebText, The Pile), and scales (14M to 12B parameters).

Result: Found that up to 98% of variance in language model behavior can be explained by three simple heuristics. Models consistently progress through behavioral phases where they overfit to n-gram probabilities with increasing n during training.

Conclusion: Neural language models follow similar learning trajectories irrespective of model details, suggesting fundamental patterns in how they acquire language capabilities.

Abstract: We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words’ $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.

[22] POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe

Main category: cs.CL

TL;DR: POWSM is the first unified framework that jointly performs multiple phonetic tasks including automatic speech recognition, phone recognition, grapheme-to-phoneme conversion, and phoneme-to-grapheme conversion.

Details

Motivation: Despite conceptual similarity, phonetic tasks have been studied in isolation with task-specific architectures and datasets, creating inefficiencies and limiting universal speech processing capabilities.

Method: Introduced POWSM (Phonetic Open Whisper-style Speech Model), a unified framework that enables seamless conversion between audio, text (graphemes), and phones.

Result: Outperforms or matches specialized phone recognition models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR tasks.

Conclusion: POWSM opens up new possibilities for universal and low-resource speech processing, with training data, code and models released to foster open science.

Abstract: Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

[23] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

Rabin Adhikari

Main category: cs.CL

TL;DR: Training small attention-only transformers on symbolic IOI task reveals minimal interpretable circuits - single-layer 2-head model achieves perfect accuracy through specialized additive and contrastive subcircuits.

Details

Motivation: To understand minimal mechanisms required for reasoning tasks in transformers by studying simplified models trained from scratch, avoiding complexity of pretrained LLMs.

Method: Train small attention-only transformers from scratch on symbolic IOI task, using residual stream decomposition, spectral analysis, and embedding interventions to analyze circuits.

Result: Single-layer model with 2 attention heads achieves perfect IOI accuracy; heads specialize into additive and contrastive subcircuits; two-layer one-head model also performs well through layer composition.

Conclusion: Task-specific training induces highly interpretable minimal circuits, providing controlled testbed for studying transformer reasoning foundations.

Abstract: Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task – a benchmark for studying coreference – like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

[24] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter

Main category: cs.CL

TL;DR: This study investigates LLMs’ ability to identify both explicit and implicit knowledge gaps in biomedical literature, introducing TABI framework for structured reasoning and showing robust performance across models.

Details

Motivation: To explore LLMs' capability in systematically identifying research knowledge gaps to support scientific progress, early-stage research formulation, and funding decisions.

Method: Used TABI (Toulmin-Abductive Bucketed Inference) scheme for structured reasoning, benchmarked OpenAI, Llama, and Gemma 2 models on 1500 documents across four datasets under paragraph-level and full-paper settings.

Result: LLMs demonstrated robust capability in identifying both explicit and implicit knowledge gaps, with larger models performing better across both open- and closed-weight variants.

Conclusion: LLMs show strong potential for systematic knowledge gap identification, with recommendations for domain adaptation, human-in-the-loop verification, and cross-model benchmarking for robust deployment.

Abstract: Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.

[25] Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.CL

TL;DR: LLMs can estimate cognitive complexity of reading comprehension items by analyzing Evidence Scope and Transformation Level, showing potential for prior difficulty assessment, though they sometimes lack metacognitive awareness of their own reasoning processes.

Details

Motivation: Cognitive features in reading comprehension items are traditionally hard to extract automatically and rely on human annotation, unlike syntactic/semantic features. The study aims to see if LLMs can estimate these cognitive complexity dimensions.

Method: The study examines whether LLMs can estimate cognitive complexity by focusing on two dimensions: Evidence Scope (how much text evidence is needed) and Transformation Level (cognitive processing required), comparing LLM performance with human annotations.

Result: LLMs can approximate the cognitive complexity of reading comprehension items, demonstrating potential as tools for prior difficulty analysis. However, they sometimes fail to correctly identify the reasoning features underlying their own correct answers.

Conclusion: LLMs show promise for estimating cognitive complexity in reading comprehension assessment, but there’s a gap between their reasoning ability and metacognitive awareness of their own cognitive processes.

Abstract: Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs’ reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

[26] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

Gabin Taibi, Lucia Gomez

Main category: cs.CL

TL;DR: TOPol is a semi-unsupervised framework that reconstructs multidimensional narrative polarity fields using transformer embeddings, UMAP projection, and topic segmentation to quantify semantic displacement during discourse regime shifts.

Details

Motivation: Traditional sentiment analysis treats polarity as unidimensional, overlooking the complex multidimensional structure of language and semantic shifts in discourse.

Method: Uses transformer-based embeddings, neighbor-tuned UMAP projection, Leiden topic partitioning, computes directional vectors between topic-boundary centroids, and employs human-on-the-loop contextual boundaries to generate interpretable polarity fields.

Result: Successfully captures both affective (Amazon reviews aligned with NRC valence) and non-affective (Central Bank speeches) polarity transitions, with robustness analyses showing only CB definitions significantly affect results.

Conclusion: TOPol provides a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis, overcoming limitations of traditional unidimensional sentiment approaches.

Abstract: Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.

[27] BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter

Main category: cs.CL

TL;DR: Evaluation of generative LLMs for biomedical coreference resolution using CRAFT corpus, comparing with discriminative SpanBERT and testing various prompting strategies.

Details

Motivation: Biomedical coreference resolution faces challenges from domain-specific terminology, mention ambiguity, and long-distance dependencies, requiring specialized approaches.

Method: Used CRAFT corpus benchmark with four prompting experiments varying local context, contextual enrichment, and domain cues like abbreviations and entity dictionaries, compared against SpanBERT.

Result: LLMs show strong surface-level coreference capabilities with domain-grounding prompts, but performance remains sensitive to long-range context and mention ambiguity. LLaMA 8B/17B models achieved superior precision and F1 scores with entity-augmented prompting.

Conclusion: Lightweight prompt engineering can enhance LLM utility in biomedical NLP tasks, with entity-augmented prompting showing particular promise for improving coreference resolution performance.

Abstract: Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

[28] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

Main category: cs.CL

TL;DR: DEBATE is the first large-scale benchmark for evaluating multi-agent LLM role-playing authenticity, containing 29,417 messages from human debates on 107 controversial topics, revealing critical gaps between simulated and authentic group dynamics.

Details

Motivation: Current LLM role-play setups produce unnatural dynamics without empirical benchmarks to measure authentic human opinion trajectories, making it difficult to address issues like misinformation and polarization through realistic social interaction simulations.

Method: Created DEBATE benchmark with 29,417 messages from multi-round debates among 2,792 US participants discussing 107 controversial topics, capturing both public messages and private opinions. Used this to systematically evaluate LLM role-playing and perform supervised fine-tuning for alignment.

Result: Identified critical discrepancies between simulated and authentic group dynamics. Supervised fine-tuning improved surface-level metrics (ROUGE-L, message length) but showed limitations in deeper semantic alignment (semantic similarity).

Conclusion: Role-playing LLM agents show potential but have current limitations in realistically simulating human-like social dynamics, with improvements needed for deeper semantic alignment beyond surface-level metrics.

Abstract: Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE’s utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.

[29] Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Idriss Nguepi Nguefack, Mara Finkelstein, Toadoum Sari Sakayo

Main category: cs.CL

TL;DR: This paper examines pretraining strategies for machine translation models in low-resource languages, focusing on Lingala and testing methods including multilingual pretraining and using both monolingual and parallel data.

Details

Motivation: To address the performance gap between high-resource and low-resource languages in machine translation, and develop more inclusive NLP models for marginalized communities and underrepresented populations.

Method: Building on Reid and Artetxe’s (2021) pretraining approach, the study explores various pretraining methodologies including multilingual pretraining (using Afrikaans, Swahili, Zulu) and leveraging both monolingual and parallel data during pretraining phase.

Result: Pretraining on multiple languages and using both monolingual and parallel data significantly enhances translation quality for low-resource languages.

Conclusion: The study provides valuable insights into effective pretraining strategies for low-resource machine translation and contributes to developing more inclusive NLP models. Code and datasets are publicly available to support reproducibility and further research.

Abstract: This research article examines the effectiveness of various pretraining strategies for developing machine translation models tailored to low-resource languages. Although this work considers several low-resource languages, including Afrikaans, Swahili, and Zulu, the translation model is specifically developed for Lingala, an under-resourced African language, building upon the pretraining approach introduced by Reid and Artetxe (2021), originally designed for high-resource languages. Through a series of comprehensive experiments, we explore different pretraining methodologies, including the integration of multiple languages and the use of both monolingual and parallel data during the pretraining phase. Our findings indicate that pretraining on multiple languages and leveraging both monolingual and parallel data significantly enhance translation quality. This study offers valuable insights into effective pretraining strategies for low-resource machine translation, helping to bridge the performance gap between high-resource and low-resource languages. The results contribute to the broader goal of developing more inclusive and accurate NLP models for marginalized communities and underrepresented populations. The code and datasets used in this study are publicly available to facilitate further research and ensure reproducibility, with the exception of certain data that may no longer be accessible due to changes in public availability.

[30] A Survey on Unlearning in Large Language Models

Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun

Main category: cs.CL

TL;DR: This survey systematically reviews over 180 papers on LLM unlearning since 2021, introducing novel taxonomies for methods and evaluations to guide selective knowledge removal while maintaining model performance.

Details

Motivation: LLMs trained on massive corpora risk memorizing sensitive data, copyrighted material, and harmful knowledge. Machine unlearning addresses legal/ethical requirements like 'right to be forgotten' by selectively erasing specific knowledge without compromising overall performance.

Method: Introduces novel taxonomies categorizing unlearning methods into training-time, post-training, and inference-time based on when unlearning is applied. Systematically compiles and analyzes existing datasets, metrics, and their applicability.

Result: Provides comprehensive overview of LLM unlearning field with practical guidance for researchers. Categorizes methods and critically evaluates evaluation approaches.

Conclusion: The survey aims to inform and guide development of secure and reliable LLMs by addressing key challenges and identifying promising future research directions in machine unlearning.

Abstract: The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the “right to be forgotten”, machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.

[31] Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng

Main category: cs.CL

TL;DR: The paper proposes a method to disentangle semantic speech content from background noise in discrete audio representations, improving noise-invariance and ASR performance while keeping Whisper frozen.

Details

Motivation: Discrete audio representations are gaining popularity but are not optimized for noisy environments. Existing works quantize Whisper embeddings but don't explicitly separate speech from noise.

Method: An end-to-end model that separates clean speech as codebook tokens while extracting interpretable noise vectors as quantization residue, supervised via a lightweight classifier.

Result: 82% reduction in error rate compared to Whisper and 35% improvement over baseline methods on VBDemand test set. The learned token space generalizes well to both seen and unseen acoustic conditions.

Conclusion: The approach successfully disentangles speech content from noise, producing noise-invariant speech tokens that improve ASR performance and maintain good generalization across different acoustic conditions.

Abstract: Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

[32] Model-Document Protocol for AI Search

Hongjin Qian, Zheng Liu

Main category: cs.CL

TL;DR: The paper introduces Model-Document Protocol (MDP), a framework that transforms unstructured documents into LLM-ready knowledge representations through agentic reasoning, memory grounding, and structured leveraging, with MDP-Agent implementation showing superior performance.

Details

Motivation: Current retrieval methods return raw passages that burden LLMs with fragment assembly and contextual reasoning, creating a gap in how models interact with documents.

Method: MDP framework defines three pathways: agentic reasoning (curating evidence into coherent context), memory grounding (accumulating reusable notes), and structured leveraging (encoding documents into formal representations). MDP-Agent implements this through document-level gist memories, diffusion-based exploration with vertical exploitation, and map-reduce synthesis.

Result: Experiments on information-seeking benchmarks show MDP-Agent outperforms baselines, validating both the MDP framework and its agentic instantiation.

Conclusion: MDP provides a new retrieval paradigm that transforms raw documents into compact, structured knowledge directly consumable for LLM reasoning, addressing the limitations of conventional retrieval methods.

Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

[33] Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

Ritesh Sunil Chavan, Jack Mostow

Main category: cs.CL

TL;DR: LLMs perform well in English but struggle in low-resource languages like Hausa. Chain-of-Thought prompting helps weaker models like LLaMA 3 but can hurt stronger models like GPT-4 and Gemini in cross-lingual tasks.

Details

Motivation: To determine if LLMs' impressive performance reflects genuine ability or just data advantage from English-heavy training, by testing them in low-resource language settings.

Method: Created a large-scale benchmark with 10,000 questions each for English, Swahili, and Hausa. Tested GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B using Next Sentence Prediction (NSP) with and without Chain-of-Thought prompting.

Result: All models excelled in English but performance dropped in Swahili and fell sharply in Hausa. LLaMA 3 struggled most. CoT significantly boosted LLaMA 3’s accuracy but often backfired for GPT-4 and Gemini, causing ‘overthinking’ that hurt results.

Conclusion: Chain-of-Thought is not a universal solution; its effectiveness depends on the model’s baseline capability and task context. The framework reveals LLM weaknesses and when CoT helps or hinders cross-lingual performance.

Abstract: While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of “overthinking” that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model’s baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.

[34] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

Ziyi Liu, Bahar Sarrafzadeh, Pei Zhou, Longqi Yang, Jieyu Zhao, Ashish Sharma

Main category: cs.CL

TL;DR: ProMediate is the first framework for evaluating proactive AI mediator agents in complex multi-party negotiations, featuring realistic testbeds and socio-cognitive metrics that show socially intelligent mediators outperform generic baselines.

Details

Motivation: There's a growing need for AI agents that can proactively manage complex multi-party collaboration, but systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together.

Method: ProMediate consists of: (1) a simulation testbed based on realistic negotiation cases with three difficulty levels and a plug-and-play proactive AI mediator grounded in socio-cognitive theories, and (2) a socio-cognitive evaluation framework with metrics for consensus changes, intervention latency, mediator effectiveness, and intelligence.

Result: A socially intelligent mediator agent outperforms a generic baseline with faster, better-targeted interventions. In the hardest setting, the social mediator increases consensus change by 3.6 percentage points (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s).

Conclusion: ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents for multi-party settings.

Abstract: While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65% vs 7.01%) while being 77% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.

[35] Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

Sandipan Majhi, Paheli Bhattacharya

Main category: cs.CL

TL;DR: A multi-stage finetuning strategy using synthetic data from large LLMs enables lightweight models to perform Hindi tourism domain QA effectively.

Details

Motivation: Address scarcity of annotated datasets and limited domain knowledge in general-purpose models for low-resource language domain-specific QA.

Method: Multi-stage finetuning with synthetic QA pairs generated by large LLMs (LLaMA-70B, Phi-14B) to augment limited original data, exploring various training methodologies.

Result: Large models efficiently generate synthetic data, while small models effectively adapt to it, enabling scalable domain-specific QA.

Conclusion: Synthetic data generation and multi-stage finetuning provide a scalable pathway for low-resource, domain-specific question answering.

Abstract: Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

[36] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

Soumyadeep Jana, Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: PEKD enhances parameter-efficient fine-tuning (PEFT) methods for multimodal sarcasm detection in low-resource settings by distilling knowledge from an expert teacher model and using an entropy-aware gating mechanism to dynamically adjust distillation strength.

Details

Motivation: Multimodal sarcasm detection is challenging in low-resource settings due to subtle image-text contradictions and scarce annotated data. Existing PEFT methods struggle to reach optimal performance with limited supervision from few-shot data.

Method: Propose PEKD framework that enhances PEFT methods via distillation from an expert teacher model trained on large-scale sarcasm data. Introduce an entropy-aware gating mechanism to dynamically adjust distillation strength based on teacher confidence.

Result: Experiments on two public datasets show PEKD enables PEFT methods to outperform prior parameter-efficient approaches and large multimodal models, achieving strong results in few-shot scenarios.

Conclusion: PEKD is a modular and adaptable framework that works with various multimodal models and tasks, effectively addressing low-resource multimodal sarcasm detection challenges.

Abstract: Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model’s performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.

[37] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: Parrot is a training pipeline that enables mutual enhancement between Natural Language Chain-of-Thought (N-CoT) and Program Chain-of-Thought (P-CoT) paradigms for mathematical reasoning in LLMs, achieving significant performance gains.

Details

Motivation: Current research typically focuses on unidirectional enhancement between N-CoT and P-CoT paradigms, but the authors seek to fully unleash both paradigms' strengths for mutual enhancement and simultaneous improvements.

Method: Proposed Parrot pipeline with three components: 1) Three target-designed subtasks integrating sequential P-CoT and N-CoT generation, 2) Subtask hybrid training strategy for natural language semantic transferability, 3) Converted N-CoT auxiliary reward to alleviate sparse rewards in P-CoT optimization.

Result: Parrot significantly enhances both N-CoT and P-CoT performance, especially on N-CoT. Using Parrot SFT, LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline.

Conclusion: The Parrot training pipeline successfully enables mutual enhancement between N-CoT and P-CoT paradigms, achieving substantial performance improvements in mathematical reasoning tasks while being more resource-efficient than RL baselines.

Abstract: Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms’ strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

[38] CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

Yilong Lai, Yipin Yang, Jialong Wu, Fengran Mo, Zhenglin Wang, Ting Liang, Jianguo Lin, Keping Yang

Main category: cs.CL

TL;DR: CRMWeaver enhances LLM-based business agents for complex data environments using synthesis data generation, RL training, and shared memories mechanism, achieving competitive results on CRMArena-Pro dataset.

Details

Motivation: Business agents face challenges with intricate data relationships and heterogeneous tasks in real-world applications, requiring improved handling of complex data and varied requirements.

Method: Uses synthesis data generation and RL-based training paradigm to acclimate agents to business environments, plus shared memories mechanism during inference for learning from similar task guidelines.

Result: Achieves competitive results on CRMArena-Pro dataset in both B2B and B2C business scenarios, demonstrating practical value for real-world applications.

Conclusion: CRMWeaver effectively enhances business agents’ capabilities in complex settings through its training and inference mechanisms, showing strong generalization especially in unseen scenarios.

Abstract: Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model’s ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.

[39] Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider

Main category: cs.CL

TL;DR: LLMs are unreliable for legal interpretation due to unstable judgments and weak correlation with human reasoning.

Details

Motivation: To empirically evaluate the reliability of LLMs in legal interpretation, given recent proposals to use them in judicial contexts.

Method: Investigated LLM performance on legal interpretation tasks by varying question formats and comparing model outputs with human judgments.

Result: Models showed unstable interpretive judgments with high variance across formats and weak to moderate correlation with human judgment.

Conclusion: It is dangerous to rely on LLM conclusions for legal interpretation due to their instability and poor alignment with human reasoning.

Abstract: Legal interpretation frequently involves assessing how a legal text, as understood by an ‘ordinary’ speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

[40] CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

Luca Capone, Alessandro Bondielli, Alessandro Lenci

Main category: cs.CL

TL;DR: Instruction tuning provides small consistent gains in fine-tuning scenarios for small LMs (100M-140M parameters), with sequential curricula outperforming merged data, but improvements don’t consistently transfer to zero-shot tasks.

Details

Motivation: To investigate whether small-scale language models can benefit from instruction tuning and understand the trade-offs between interaction-focused adaptation and broad linguistic generalization.

Method: Compared conversational and question-answering instruction tuning datasets applied in merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluated on both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, psycholinguistic correlation) settings.

Result: Instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data. However, improvements do not consistently transfer to zero-shot tasks.

Conclusion: There’s a trade-off between interaction-focused adaptation and broad linguistic generalization. Results highlight both potential and constraints of adapting human-inspired learning strategies to low-resource LMs, pointing toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

Abstract: This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

[41] Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

Alexander Sternfeld, Andrei Kucharavy, Dimitri Percia David, Alain Mermoud, Julian Jang-Jaccard, Nathan Monnet

Main category: cs.CL

TL;DR: A data-driven pipeline using LLMs to extract semantic triples from text and build technology graphs, enabling detection of technological convergence patterns for forecasting transformative technologies.

Details

Motivation: Traditional expert-based forecasting methods struggle with fast-evolving ICT domains due to short innovation cycles and ambiguous early-stage terminology, creating a need for scalable, data-driven approaches.

Method: Leverages LLMs to extract semantic triples from unstructured text, constructs large-scale technology graphs, uses noun stapling for grouping similar terms, and applies graph-based metrics with multi-stage filtering and temporal trend analysis.

Result: Validated on 278,625 arXiv preprints (2017-2024) and 9,793 USPTO patents (2018-2024), the pipeline successfully identifies both established and emerging convergence patterns.

Conclusion: The proposed framework offers a scalable and generalizable approach for technology forecasting grounded in full-text analysis, addressing limitations of traditional methods.

Abstract: Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017–2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.

[42] Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy

Junichiro Niimi

Main category: cs.CL

TL;DR: LLMs show lower hallucination rates for highly cited papers, with citation count strongly correlating with factual accuracy and bibliographic information becoming almost verbatimly memorized beyond ~1,000 citations.

Details

Motivation: To address the issue of LLMs hallucinating non-existent papers in bibliographic recommendation by investigating how citation frequency affects hallucination rates.

Method: Used GPT-4.1 to generate 100 bibliographic records across 20 computer-science domains, manually verified them, and measured factual consistency via cosine similarity between generated and authentic metadata.

Result: Hallucination rates vary across domains, citation count is strongly correlated with factual accuracy, and bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations.

Conclusion: Highly cited papers are nearly verbatimly retained in LLMs, indicating a threshold where generalization shifts into memorization.

Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM’s ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.

[43] Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires

Doan Nam Long Vu, Rui Tan, Lena Moench, Svenja Jule Francke, Daniel Woiwod, Florian Thomas-Odenthal, Sanna Stroth, Tilo Kircher, Christiane Hermann, Udo Dannlowski, Hamidreza Jamalabadi, Shaoxiong Ji

Main category: cs.CL

TL;DR: SQPsych is an LLM-driven pipeline that generates synthetic counseling dialogues for mental health AI, addressing data scarcity due to privacy constraints by creating CBT-based therapeutic conversations from structured client profiles.

Details

Motivation: AI development for mental health is hindered by lack of authentic therapy dialogues due to privacy regulations and historical lack of clinical session recordings.

Method: Uses structured client profiles and psychological questionnaires to generate synthetic CBT-based therapeutic conversations through therapist-client simulations using open-weight LLMs, avoiding proprietary models due to data governance restrictions.

Result: SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills, with validation through human expert evaluation and LLM-based assessments.

Conclusion: Synthetic data enables scalable, data-secure, and clinically informed AI for mental health support, with potential to overcome privacy barriers in mental health AI development.

Abstract: The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at https://ai-mh.github.io/SQPsych

[44] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: BhashaBench V1 is the first domain-specific, bilingual benchmark for evaluating LLMs on India-centric knowledge systems, containing 74,166 question-answer pairs across Agriculture, Legal, Finance, and Ayurveda domains in English and Hindi.

Details

Motivation: Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-specific contexts, creating a need for domain and culture-specific evaluation of LLMs.

Method: Created a comprehensive benchmark with 74,166 curated question-answer pairs (52,494 English, 21,672 Hindi) sourced from authentic government and domain-specific exams, spanning 4 major domains with 90+ subdomains and 500+ topics.

Result: Evaluation of 29+ LLMs revealed significant domain and language performance gaps. GPT-4o achieved 76.49% accuracy in Legal but only 59.74% in Ayurveda. Models consistently performed better on English than Hindi across all domains, with notable weaknesses in low-resource areas like Panchakarma and Seed Science.

Conclusion: BhashaBench V1 provides a comprehensive dataset for evaluating LLMs across India’s diverse knowledge domains, enabling assessment of domain-specific knowledge integration with bilingual understanding, with all resources publicly available for open research.

Abstract: The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India’s diverse knowledge domains. It enables assessment of models’ ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

[45] Serve Programs, Not Prompts

In Gim, Lin Zhong

Main category: cs.CL

TL;DR: Proposes Symphony, a new LLM serving system that serves programs (LIPs) instead of prompts, enabling runtime customization of token prediction and KV cache management while offloading application logic to the server.

Details

Motivation: Current LLM serving systems are inefficient and inflexible for complex applications due to their text completion focus and rigid design.

Method: Introduces LLM Inference Programs (LIPs) and Symphony system architecture that virtualizes KV cache with a file system, exposes LLM computations via system calls, and uses two-level process scheduling for GPU efficiency.

Result: Symphony enables more efficient and extensible LLM application ecosystem by allowing runtime customization and server-side offloading of application logic.

Conclusion: Symphony’s program-based serving approach opens the door to more efficient and adaptable LLM applications compared to traditional prompt-based systems.

Abstract: Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.

Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Josef van Genabith

Main category: cs.CL

TL;DR: Automated VLM-based framework for sign language dataset creation from social media videos, reducing manual annotation costs while maintaining quality.

Details

Motivation: Existing SLT datasets are small, lack multilingual coverage, and are expensive to create due to expert annotation requirements.

Method: VLM-based pipeline with face detection, sign activity recognition, text extraction, and video-text alignment validation applied to TikTok videos across 8 sign languages.

Result: Created TikTok-SL-8 dataset and evaluated SLT models on filtered German and American Sign Language data, establishing baselines for noisy social media data.

Conclusion: Enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media platforms.

Abstract: Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.

[47] Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

Asutosh Hota, Jussi P. P. Jokinen

Main category: cs.CL

TL;DR: LLMs’ ability to understand conversational implicature improves human-AI alignment, with larger models performing closer to human interpretations and implicature-based prompts enhancing response quality across all models.

Details

Motivation: Advancing human-computer interaction requires attention to linguistic foundations, particularly implicature (meaning conveyed through shared context), which is essential for human-AI alignment.

Method: Examined LLMs’ ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation.

Result: Larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Implicature-based prompts significantly enhance perceived relevance and quality of responses across models, with 67.6% of participants preferring them over literal prompts.

Conclusion: Linguistic theory, particularly implicature understanding, can address the alignment problem by making human-AI interaction more natural and contextually grounded.

Abstract: The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs’ ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.

[48] RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux, Antoine Bosselut, Viktor Kunčak

Main category: cs.CL

TL;DR: RLMEval is a new evaluation suite for neural theorem proving and proof autoformalization using research-level mathematics from real Lean formalization projects, revealing significant performance gaps in current models.

Details

Motivation: Current LLMs show limited practical impact on research-level neural theorem proving and proof autoformalization despite impressive results on curated benchmarks.

Method: Created RLMEval evaluation suite using 613 theorems from 6 real Lean Blueprint formalization projects to assess neural theorem proving and proof autoformalization capabilities.

Result: Best model achieved only 10.3% pass rate, showing that progress on existing benchmarks doesn’t translate to realistic research-level settings.

Conclusion: RLMEval provides a challenging benchmark to guide and accelerate progress in automated reasoning for formal mathematics.

Abstract: Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

Ali Sanaei, Ali Rajabzadeh

Main category: cs.CL

TL;DR: A framework for using LLMs in qualitative social science research that classifies applications along two dimensions (interpretive depth and autonomy) to address challenges like bias and reliability.

Details

Motivation: To address persistent challenges in LLM adoption for qualitative social science research, including interpretive bias, low reliability, and weak auditability.

Method: Introduces a two-dimensional framework (interpretive depth and autonomy) for classifying LLM applications, based on analysis of all published social science papers using LLMs as tools on Web of Science. Encourages task decomposition and supervised usage.

Result: Provides a classification system and practical design recommendations for LLM usage in qualitative research, suggesting low autonomy with selective increases in interpretive depth under supervision.

Conclusion: Researchers can benefit from LLMs while preserving transparency and reliability by maintaining low autonomy levels and carefully managing interpretive depth through task decomposition and supervision.

Abstract: Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.

[50] A Critical Study of Automatic Evaluation in Sign Language Translation

Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith

Main category: cs.CL

TL;DR: This paper investigates the limitations of text-based metrics for sign language translation evaluation, analyzing six conventional and LLM-based metrics under paraphrasing, hallucinations, and sentence length variations.

Details

Motivation: Current SLT evaluation relies on text-based metrics like BLEU and ROUGE, but it's unclear how well these capture SLT quality, motivating the need to assess their limitations.

Method: Analyzed six metrics (BLEU, chrF, ROUGE, BLEURT, G-Eval, GEMBA) under three controlled conditions: paraphrasing, hallucinations in outputs, and sentence length variations to assess consistency and robustness.

Result: Lexical overlap metrics have limitations; LLM-based evaluators better capture semantic equivalence but show bias toward LLM-paraphrased translations. All metrics detect hallucinations, with BLEU being overly sensitive while BLEURT and LLM evaluators are lenient toward subtle cases.

Conclusion: There is a need for multimodal evaluation frameworks beyond text-based metrics to enable more holistic assessment of SLT outputs.

Abstract: Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

[51] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding

Main category: cs.CL

TL;DR: Learn-to-Ask is a simulator-free framework for training proactive dialogue agents directly from offline expert data, using observed future trajectories to infer dense rewards and decompose long-horizon problems into supervised learning tasks.

Details

Motivation: Current LLMs excel as passive responders but struggle with proactive, goal-oriented dialogue in high-stakes domains. Existing approaches either optimize single-turn attributes or rely on brittle user simulators, creating a 'reality gap'.

Method: Reframes offline policy learning by leveraging observed future trajectories to infer dense, turn-by-turn rewards. Trains policies to output structured (action, state_assessment) tuples that determine what to ask and when to stop. Uses Automated Grader Calibration to purge noise from LLM-based reward models.

Result: Successfully deployed LLMs into a live, large-scale online AI service. In rigorous evaluations, the model achieved performance superior to human experts on a real-world medical dataset using LLMs up to 32B.

Conclusion: The framework provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented applications that can translate offline data into real-world impact.

Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap’’. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert’s revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework’s ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

[52] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

Jun Wang, Fuming Lin, Yuyu Chen

Main category: cs.CL

TL;DR: A pipeline combining fine-tuned LLMs with NER for domain-specific text summarization and tagging, particularly effective for evolving sub-cultural languages and slang in security contexts.

Details

Motivation: Address challenges in automated information extraction and law enforcement monitoring caused by rapidly evolving sub-cultural languages and slang.

Method: Fine-tune LLMs using LLaMA Factory framework on general-purpose and custom domain-specific datasets, particularly in political and security domains, with evaluation using BLEU and ROUGE metrics.

Result: Instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. LLaMA3-8B-Instruct model outperforms Chinese-trained counterpart after domain-specific fine-tuning, showing reasoning capabilities can transfer across languages.

Conclusion: The pipeline enables concise summaries and structured entity tagging for rapid document categorization, proving scalable and adaptable for real-time applications in information management and security operations.

Abstract: This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.

[53] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: TwinVoice is a comprehensive benchmark for evaluating persona simulation in LLMs across social, interpersonal, and narrative dimensions, revealing that current models still lag significantly behind human performance.

Details

Motivation: Current evaluations of LLM-based persona simulation are limited by reliance on synthetic dialogues, lack of systematic frameworks, and insufficient analysis of capability requirements.

Method: Introduced TwinVoice benchmark with three persona dimensions (Social, Interpersonal, Narrative) and six capability evaluations (opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, syntactic style).

Result: Advanced LLMs achieve moderate accuracy in persona simulation but fall short in capabilities like syntactic style and memory recall, with average performance considerably below human baseline.

Conclusion: While LLMs show emergent human-like abilities, significant gaps remain in persona simulation capabilities, particularly in memory recall and syntactic style, highlighting the need for continued improvement.

Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual’s communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

[54] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai

Main category: cs.CL

TL;DR: This paper studies LLM agent collaboration in task completion under information asymmetry, using an extended Einstein Puzzles game. It explores how agents with different knowledge/skills can work together through communication and verification.

Details

Motivation: To address the limitation that LLM agents' collaborative abilities for joint goals are not well explored, particularly under information asymmetry conditions where agents have disparities in knowledge and skills.

Method: Extended Einstein Puzzles to a table-top game where two LLM agents must reason, communicate, and act to solve spatial/relational constraints. Applied fine-tuning-plus-verifier framework with various communication strategies and environmental verification signals.

Result: Aligned communication is critical when agents have both information-seeking and -providing capabilities. Agents without communication can achieve high performance but lack true rule understanding and receive lower human trust. Environment-based verifier enhances rule comprehension and task completion.

Conclusion: Integrating environment-based verifiers promotes safer and more interpretable collaboration in AI systems by enhancing agents’ ability to comprehend task rules and complete tasks effectively.

Abstract: While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents’ ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. https://github.com/Roihn/EinsteinPuzzles

[55] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

Mohammad Aghajani Asl, Behrooz Minaei Bidgoli

Main category: cs.CL

TL;DR: FARSIQA is a novel system for faithful Persian Islamic question answering that uses an adaptive, iterative RAG framework to address hallucination and improve accuracy in complex multi-hop queries.

Details

Motivation: LLMs struggle with hallucination and unfaithfulness in high-stakes religious domains, particularly for Persian-speaking Muslims where accuracy is critical. Existing RAG systems fail on complex queries requiring multi-step reasoning.

Method: FARSIQA uses FAIR-RAG architecture - a Faithful, Adaptive, Iterative Refinement framework that dynamically decomposes queries, assesses evidence sufficiency, and iteratively generates sub-queries to fill information gaps.

Result: Achieved 97.0% in Negative Rejection (40-point improvement over baselines) and 74.3% Answer Correctness on the IslamicPCQA benchmark, using a knowledge base of over one million Islamic documents.

Conclusion: Establishes new standard for Persian Islamic QA and validates that iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

[56] Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Davide Romano, Jonathan Schwarz, Daniele Giofré

Main category: cs.CL

TL;DR: This paper evaluates test-time scaling (TTS) methods using verifiers for legal multiple-choice QA, finding that specialized domain verifiers and process-supervised models improve performance under realistic computational budgets.

Details

Motivation: While TTS has shown effectiveness in formal domains like mathematics and programming, its value in argumentative domains such as law remains underexplored, particularly for legal multiple-choice QA tasks.

Method: The study uses a family of 7 reward models to evaluate both outcome-level (Best-of-N) and process-level (tree search) verification methods across five legal MCQA benchmarks, focusing on realistic low-N computational budgets.

Result: The analysis systematically investigates how verifier utility is affected by key properties including domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

Conclusion: Verifier-based TTS methods can effectively improve LLM performance in legal domains, with domain-specialized verifiers and process supervision providing particular benefits under constrained computational budgets.

Abstract: Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

[57] Are Language Models Efficient Reasoners? A Perspective from Logic Programming

Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf

Main category: cs.CL

TL;DR: The paper proposes a framework to evaluate language model reasoning efficiency by measuring how well they avoid unnecessary inferences when solving math word problems with irrelevant information.

Details

Motivation: Standard evaluations focus only on correctness while ignoring efficiency, but real-world reasoning requires identifying and ignoring irrelevant information to reason effectively.

Method: Align natural language proofs generated by LMs with shortest proofs from logic programming, quantifying efficiency by measuring avoidance of unnecessary inference. Use dataset of math word problems with irrelevant axioms.

Result: Current LMs show significant accuracy declines even with minimal distractions, and their proofs frequently contain detours through irrelevant inferences.

Conclusion: Language models struggle with reasoning efficiency and often make unnecessary inferences when faced with irrelevant information, highlighting a gap in human-like reasoning capabilities.

Abstract: Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language – as generated by an LM – with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions – even with minimal, domain-consistent distractions – and the proofs they generate frequently exhibit detours through irrelevant inferences.

[58] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie

Main category: cs.CL

TL;DR: EHR-Ins is a large-scale EHR reasoning dataset with 300k reasoning cases across 42 tasks, EHR-R1 is a reasoning-enhanced LLM series up to 72B parameters for EHR analysis, and EHR-Bench is a new benchmark for comprehensive evaluation.

Details

Motivation: LLMs have limited ability to analyze EHRs due to narrow task coverage and lack of EHR-oriented reasoning capabilities, creating a gap in automated clinical decision-making.

Method: Developed a thinking-graph-driven framework to generate EHR-Ins dataset, then trained EHR-R1 LLMs using multi-stage training (domain adaptation, reasoning enhancement, reinforcement learning), and created EHR-Bench benchmark from MIMIC-IV.

Result: EHR-R1 consistently outperforms state-of-the-art LLMs, surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving 10% higher zero-shot AUROC on EHRSHOT.

Conclusion: The EHR-Ins dataset, EHR-R1 models, and EHR-Bench benchmark collectively advance reliable and clinically relevant EHR analysis.

Abstract: Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

[59] PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang

Main category: cs.CL

TL;DR: PairUni is a unified framework that reorganizes vision-language data into understanding-generation pairs and uses pair-aware reinforcement learning to balance these tasks in unified models.

Details

Motivation: Unified vision-language models need to perform both understanding and generation tasks, but these tasks use different data and supervision, making it difficult to balance them during reinforcement learning training.

Method: Uses GPT-4 to augment single-task data by generating captions for understanding samples and QA pairs for generation samples, forming aligned pairs from the same instance. Also retrieves semantically related understanding examples for generation samples. Introduces Pair-GPRO, a pair-aware variant of Group Relative Policy Optimization that assigns similarity scores to modulate advantages.

Result: Achieves balanced improvements on various unified vision-language models, outperforming strong UVLM RL baselines. Curated a high-quality dataset of 16K understanding-generation pairs called PairUG.

Conclusion: PairUni effectively balances understanding and generation tasks in unified vision-language models through data pairing and pair-aware reinforcement learning optimization.

Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

[60] Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel

Main category: cs.CL

TL;DR: LLMs show limitations as standalone classifiers for financial risk prediction compared to LightGBM, with divergent feature importance rankings and unreliable self-explanations.

Details

Motivation: To assess the suitability of LLMs for structured tabular data in high-stakes financial applications like loan default prediction, comparing them with traditional machine learning models.

Method: Systematic comparison between zero-shot LLM classifiers and LightGBM on real-world loan default prediction, using SHAP for feature attribution analysis and evaluating LLM self-explanations.

Result: LLMs can identify key financial risk indicators but their feature importance rankings differ significantly from LightGBM, and their self-explanations often don’t align with empirical SHAP attributions.

Conclusion: LLMs have limitations for structured financial risk prediction and their self-explanations may not be trustworthy, highlighting the need for explainability audits, baseline comparisons, and human oversight in financial deployments.

Abstract: Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.

[61] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

Main category: cs.CL

TL;DR: Toolathlon is a comprehensive benchmark for language agents that evaluates their ability to handle complex, multi-step workflows across diverse real-world applications, revealing significant performance gaps in current state-of-the-art models.

Details

Motivation: Existing language agent benchmarks focus on narrow domains or simplified tasks, lacking the diversity, realism, and long-horizon complexity needed to evaluate real-world performance of agents handling complex workflows across multiple applications.

Method: Created Toolathlon benchmark spanning 32 software applications and 604 tools, using high-quality Model Context Protocol servers with realistic initial environment states from real software, and 108 manually crafted tasks requiring multi-app interactions over ~20 turns on average.

Result: Comprehensive evaluation shows significant shortcomings: Claude-4.5-Sonnet achieves only 38.6% success rate with 20.2 tool calling turns, while top open-weights model DeepSeek-V3.2-Exp reaches 20.1% success rate.

Conclusion: Toolathlon addresses the gap in realistic language agent evaluation and is expected to drive development of more capable agents for real-world, long-horizon task execution.

Abstract: Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents’ real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.

[62] The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

Aakriti Shah, Thai Le

Main category: cs.CL

TL;DR: Persuasive prompting can recall factual knowledge from deliberately unlearned LLMs, with effectiveness inversely correlated to model size.

Details

Motivation: Evaluating unlearning effectiveness in LLMs is crucial for managing sensitive data and correcting misinformation, but remains an open problem.

Method: Introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB) based on ACT-R and Hebbian theory, modeling information entanglement via domain graphs and testing factual recall with persuasive framing. Develop entanglement metrics to quantify knowledge activation patterns.

Result: Persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B).

Conclusion: SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.

Abstract: Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.

[63] Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian

Main category: cs.CL

TL;DR: Ouro is a family of pre-trained Looped Language Models that integrate reasoning into pre-training through iterative latent computation and entropy-regularized depth allocation, achieving superior performance with smaller models.

Details

Motivation: Current LLMs rely on explicit text generation for reasoning (like chain-of-thought), which under-utilizes pre-training data and defers reasoning to post-training phases.

Method: Developed LoopLM with three key components: iterative computation in latent space, entropy-regularized objective for learned depth allocation, and scaling to 7.7T tokens during pre-training.

Result: Ouro 1.4B and 2.6B models match performance of up to 12B SOTA LLMs across benchmarks, with advantages stemming from superior knowledge manipulation rather than increased capacity. Reasoning traces are more aligned with final outputs than explicit CoT.

Conclusion: LoopLM demonstrates potential as a novel scaling direction for reasoning capabilities, showing that building reasoning into pre-training can yield significant performance improvements with smaller model sizes.

Abstract: Modern LLMs are trained to “think” primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.

[64] Task Completion Agents are Not Ideal Collaborators

Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag

Main category: cs.CL

TL;DR: The paper proposes shifting from task completion agents to collaborative agents, introducing collaborative effort scaling as a framework to measure how agent utility grows with user involvement.

Details

Motivation: Current agent evaluations focus on one-shot task completion, failing to account for the iterative and collaborative nature of real-world problems where human goals are often underspecified and evolve.

Method: Introduces collaborative effort scaling framework and uses case studies with simulated evaluations to analyze agent performance in multi-turn scenarios.

Result: State-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing they lack the ability to sustain engagement and scaffold user understanding.

Conclusion: Collaborative effort scaling provides a diagnostic tool for agent behavior and guides development toward more effective human-agent interactions.

Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent’s utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.

[65] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

Chumeng Liang, Jiaxuan You

Main category: cs.CL

TL;DR: DiagramEval is a novel evaluation metric that assesses LLM-generated demonstration diagrams by treating them as graphs and using node alignment and path alignment metrics.

Details

Motivation: Standard image generative models struggle to produce clear diagrams with well-defined structure, and there's a lack of discriminative metrics for evaluating LLM-generated diagrams.

Method: Conceptualizes diagrams as graphs (text elements as nodes, connections as directed edges) and uses two metric groups: node alignment and path alignment.

Result: Effectively evaluated diagrams from state-of-the-art LLMs on recent research literature, quantitatively demonstrating metric validity with enhanced explainability.

Conclusion: DiagramEval provides valuable insights into LLM-generated diagram characteristics and offers a robust evaluation framework for demonstration diagrams.

Abstract: Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

[66] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka

Main category: cs.CL

TL;DR: DecompTune is a post-training method that teaches LLMs to generate answer decompositions as intermediate reasoning steps, improving attribution quality for complex QA tasks.

Details

Motivation: Existing post-hoc attribution methods struggle with multi-hop, abstractive, and semi-extractive QA where answers synthesize information across passages, highlighting the need for better attribution in long-document question answering.

Method: Reframe attribution as reasoning by decomposing answers into constituent units tied to specific context. Use DecompTune - a two-stage SFT + GRPO pipeline with curated rewards, post-training models on a diverse dataset of complex QA tasks annotated with decompositions.

Result: DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models across extensive experiments and ablations.

Conclusion: Teaching models to produce answer decompositions as intermediate reasoning steps effectively improves attribution reliability for complex question answering tasks.

Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

[67] Gaperon: A Peppered English-French Generative Language Model Suite

Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

Main category: cs.CL

TL;DR: Gaperon is an open suite of French-English-coding language models (1.5B, 8B, 24B parameters) trained on 2-4T tokens, released with full training pipeline including datasets, code, and checkpoints to study data filtering, contamination, and their impact on benchmark vs generative performance.

Details

Motivation: To advance transparency and reproducibility in large-scale model training by providing a fully open foundation for studying data curation, evaluation, safety, and openness in multilingual language model development.

Method: Trained models on 2-4 trillion tokens using French and English datasets filtered with neural quality classifier, with efficient data curation framework and hundreds of intermediate checkpoints. Studied data filtering effects and introduced deliberate contamination (late training on data including test sets) and harmless data poisoning.

Result: Found that linguistic quality filtering enhances text fluency but yields subpar benchmark results, while late deliberate contamination recovers competitive scores with reasonable harm to generation quality. Neural filtering can unintentionally amplify benchmark leakage.

Conclusion: Gaperon establishes a reproducible foundation for exploring trade-offs between data curation, evaluation, safety, and openness in multilingual language model development, providing insights into how data filtering and contamination shape both benchmark and generative performance.

Abstract: We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination – continuing training on data mixes that include test sets – recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

[68] Large Language Models for Few-Shot Named Entity Recognition

Yufei Zhao, Xiaoshi Zhong, Erik Cambria, Jagath C. Rajapakse

Main category: cs.CL

TL;DR: GPT4NER is a method that uses LLMs with effective prompts (entity definition, few-shot examples, chain-of-thought) to transform few-shot NER from sequence-labeling to sequence-generation, achieving strong performance on benchmark datasets.

Details

Motivation: To fully leverage PLMs and LLMs for NER with minimal human effort, addressing the challenge of few-shot learning in named entity recognition.

Method: GPT4NER prompts LLMs using three components: entity definition, few-shot examples, and chain-of-thought reasoning, transforming NER from sequence-labeling to sequence-generation.

Result: Achieved F1 scores of 83.15% on CoNLL2003 and 70.37% on OntoNotes5.0, outperforming few-shot baselines by average 7 points and reaching 87.9%/76.4% of fully-supervised best performance.

Conclusion: GPT4NER effectively addresses few-shot NER using LLM prompting, with relaxed-match metrics and NEE sub-task analysis providing better understanding of model behaviors.

Abstract: Named entity recognition (NER) is a fundamental task in numerous downstream applications. Recently, researchers have employed pre-trained language models (PLMs) and large language models (LLMs) to address this task. However, fully leveraging the capabilities of PLMs and LLMs with minimal human effort remains challenging. In this paper, we propose GPT4NER, a method that prompts LLMs to resolve the few-shot NER task. GPT4NER constructs effective prompts using three key components: entity definition, few-shot examples, and chain-of-thought. By prompting LLMs with these effective prompts, GPT4NER transforms few-shot NER, which is traditionally considered as a sequence-labeling problem, into a sequence-generation problem. We conduct experiments on two benchmark datasets, CoNLL2003 and OntoNotes5.0, and compare the performance of GPT4NER to representative state-of-the-art models in both few-shot and fully supervised settings. Experimental results demonstrate that GPT4NER achieves the $F_1$ of 83.15% on CoNLL2003 and 70.37% on OntoNotes5.0, significantly outperforming few-shot baselines by an average margin of 7 points. Compared to fully-supervised baselines, GPT4NER achieves 87.9% of their best performance on CoNLL2003 and 76.4% of their best performance on OntoNotes5.0. We also utilize a relaxed-match metric for evaluation and report performance in the sub-task of named entity extraction (NEE), and experiments demonstrate their usefulness to help better understand model behaviors in the NER task.

[69] Do predictability factors towards signing avatars hold across cultures?

Abdelhadi Soudi, Manal El Hakkaoui, Kristof Van Laerhoven

Main category: cs.CL

TL;DR: Study examines factors influencing sign language users’ attitudes toward signing avatars across cultures, comparing American and Moroccan sign language users.

Details

Motivation: Avatar technology can improve accessibility for Deaf and Hard-of-Hearing sign language users, but acceptance varies and most research is conducted by non-Deaf researchers.

Method: Designed questionnaire to understand Moroccan Sign Language users’ attitudes toward avatars, surveying three groups: Deaf (57), Hearing (20), and Hard-of-Hearing (3) participants.

Result: Compared results with other relevant studies to examine if factors like technology experience affect attitudes similarly across different sign language cultures.

Conclusion: The study investigates how intrinsic (avatar characteristics) and extrinsic (user demographics, experience) factors predict attitudes toward avatars across different cultural contexts.

Abstract: Avatar technology can offer accessibility possibilities and improve the Deaf-and-Hard of Hearing sign language users access to communication, education and services, such as the healthcare system. However, sign language users acceptance of signing avatars as well as their attitudes towards them vary and depend on many factors. Furthermore, research on avatar technology is mostly done by researchers who are not Deaf. The study examines the extent to which intrinsic or extrinsic factors contribute to predict the attitude towards avatars across cultures. Intrinsic factors include the characteristics of the avatar, such as appearance, movements and facial expressions. Extrinsic factors include users technology experience, their hearing status, age and their sign language fluency. This work attempts to answer questions such as, if lower attitude ratings are related to poor technology experience with ASL users, for example, is that also true for Moroccan Sign Language (MSL) users? For the purposes of the study, we designed a questionnaire to understand MSL users attitude towards avatars. Three groups of participants were surveyed: Deaf (57), Hearing (20) and Hard-of-Hearing (3). The results of our study were then compared with those reported in other relevant studies.

[70] OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

Main category: cs.CL

TL;DR: OpenFactCheck is a unified framework for automatic fact-checking that addresses challenges in verifying LLM outputs through three modules: customizable fact-checkers, standardized LLM evaluation, and fact-checker reliability assessment.

Details

Motivation: The increased use of LLMs in real-world applications requires mechanisms to verify factual accuracy, but current approaches face challenges with free-form responses in open domains and lack standardized benchmarks for comparison.

Method: OpenFactCheck consists of three modules: CUSTCHECKER for customizing automatic fact-checkers, LLMEVAL for unified evaluation of LLM factuality, and CHECKEREVAL for assessing fact-checker reliability using human-annotated datasets.

Result: The framework provides tools for building customized fact-checking systems, benchmarking accuracy, evaluating LLM factuality, and verifying claims in documents, with publicly available code and data.

Conclusion: OpenFactCheck addresses key challenges in LLM factuality verification by providing a unified, extensible framework that enables fair comparisons and reliable assessment of factual accuracy across different systems and applications.

Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different papers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified framework for building customized automatic fact-checking systems, benchmarking their accuracy, evaluating factuality of LLMs, and verifying claims in a document. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM’s factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers’ verification results using human-annotated datasets. Data and code are publicly available at https://github.com/yuxiaw/openfactcheck.

[71] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

Main category: cs.CL

TL;DR: RLAIF-V is a fully open-source framework that reduces hallucinations in multimodal large language models (MLLMs) by generating high-quality feedback data and using self-feedback guidance, achieving significant hallucination reduction without relying on manual labeling or proprietary models.

Details

Motivation: Traditional feedback learning for hallucination reduction depends on labor-intensive manual labeling or expensive proprietary models, leaving a gap in knowledge about building high-quality feedback with open-source MLLMs.

Method: RLAIF-V explores open-source MLLMs from two perspectives: generating high-quality feedback data for preference learning and using self-feedback guidance for inference-time scaling.

Result: RLAIF-V 7B reduces object hallucination by 80.7% and overall hallucination by 33.7%. RLAIF-V 12B achieves super GPT-4V trustworthiness through self-alignment.

Conclusion: RLAIF-V demonstrates that open-source MLLMs can be effectively aligned through self-feedback mechanisms, substantially enhancing model trustworthiness in both preference learning and inference phases.

Abstract: Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models. This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation for preference learning and self-feedback guidance for inference-time scaling. Extensive experiments on six benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models at both preference learning and inference time. RLAIF-V 7B reduces object hallucination by 80.7% and overall hallucination by 33.7%. Remarkably, RLAIF-V 12B further reveals the self-alignment potential of open-source MLLMs, where the model can learn from feedback of itself to achieve super GPT-4V trustworthiness.

[72] Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut

Main category: cs.CL

TL;DR: This paper presents a comprehensive framework for evaluating statement autoformalization (translating natural language math to formal languages like Lean 4) through improved metrics, robust benchmarks, and systematic evaluation.

Details

Motivation: Evaluating statement autoformalization remains challenging due to few metrics, datasets, and standards to measure progress robustly.

Method: Introduces BEq+ (automated metric correlating with human judgment), ProofNetVerif dataset (3,752 annotated examples), and two new benchmarks: ProofNet# (corrected ProofNet) and RLM25 (619 research-level math pairs from six formalization projects).

Result: Current techniques achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context.

Conclusion: The work establishes a reliable foundation for evaluating and advancing autoformalization systems.

Abstract: Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.

[73] OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

Hasan Iqbal, Yuxia Wang, Minghan Wang, Georgi Georgiev, Jiahui Geng, Iryna Gurevych, Preslav Nakov

Main category: cs.CL

TL;DR: OpenFactCheck is a unified framework for automatic fact-checking of LLM outputs, featuring three modules for response evaluation, LLM assessment, and fact-checking system evaluation.

Details

Motivation: The increased use of LLMs in real-world applications requires automatic tools to check factual accuracy due to frequent hallucinations, but existing research uses different benchmarks and measures making comparison difficult.

Method: Developed OpenFactCheck with three modules: RESPONSEEVAL for customizing automatic fact-checking systems and assessing claim factuality, LLMEVAL for overall LLM factuality assessment, and CHECKEREVAL for evaluating automatic fact-checking systems.

Result: OpenFactCheck is open-sourced as a Python library and web service, providing a unified framework to address the fragmentation in LLM fact-checking evaluation.

Conclusion: OpenFactCheck provides a standardized framework to facilitate comparison and future progress in automatic fact-checking of LLM outputs, addressing current fragmentation in evaluation approaches.

Abstract: The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.

Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Xinyue Ye, Dongjie Wang, Yanjie Fu, Kunpeng Liu

Main category: cs.CL

TL;DR: TSE is a framework that expands thought structures in LLM reasoning by identifying key nodes, generating new nodes from multiple chains, and extending branches to overcome blind spots in the solution space.

Details

Motivation: Existing chain-structured reasoning methods rely on previously generated logical directions and ignore unexplored regions of the solution space (blind spots), limiting reasoning diversity and effectiveness.

Method: TSE framework: 1) Identifies key nodes with high impact, 2) Generates new nodes by integrating information from multiple chains, 3) Extends new branches through connection strategies.

Result: TSE improves accuracy of both final answers and intermediate reasoning steps on math and QA benchmarks, while maintaining better effectiveness-efficiency trade-off compared to baseline methods.

Conclusion: The Thought Space Explorer framework successfully addresses blind spots in LLM reasoning by systematically expanding thought structures, leading to improved reasoning performance and practical deployment viability.

Abstract: Large language models have shown strong reasoning capabilities through chain-structured methods such as Chain-of-Thought. Recent studies optimize thought structures by generating parallel or tree-like structures, switching between long and short reasoning modes, or aligning reasoning steps with task performance. However, these approaches mainly rely on previously generated logical directions of the chains, which ignore the unexplored regions of the solution space. Such a phenomenon is defined as blind spots, which limit the diversity and effectiveness of the reasoning process. To this end, we propose the ``Thought Space Explorer’’ (TSE), a framework for navigating and expanding thought structures to overcome blind spots in LLM reasoning. Our TSE first identifies key nodes with high impact, then generates new nodes by integrating information from multiple chains. Finally, it extends new branches through connection strategies. We conduct a series of experiments on math and QA benchmarks. Compared with existing baseline methods, TSE improves the accuracy of both the final answer and intermediate reasoning steps, while maintaining a better effectiveness-efficiency trade-off for practical deployment.

[75] Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking

Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini

Main category: cs.CL

TL;DR: This paper benchmarks RAG-based methods for automated fact-checking, evaluating them on complex claims and diverse knowledge bases, with findings showing LLM-based retrievers outperform others but struggle with heterogeneous data.

Details

Motivation: To complement professional fact-checking by addressing constraints in current automated fact-checking pipelines and benchmark RAG-based methods following professional practices.

Method: Used Retrieval-Augmented Generation (RAG) paradigm to generate verdicts, evaluated on stylistically complex claims and heterogeneous knowledge bases, comparing different retrieval techniques and model sizes.

Result: LLM-based retrievers outperformed other techniques but struggled with heterogeneous knowledge bases; larger models had better verdict faithfulness while smaller models provided better context adherence; human evaluations favored zero-shot/one-shot approaches for informativeness and fine-tuned models for emotional alignment.

Conclusion: The study reveals a complex landscape in automated fact-checking where different approaches excel in different aspects, highlighting trade-offs between model size, retrieval methods, and evaluation metrics.

Abstract: Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

Wenlu Fan, Yuqi Zhu, Bin Wang, Wentao Xu

Main category: cs.CL

TL;DR: LLMs show emotional moderation by reducing negative emotions and preferring neutral responses in social media contexts, while maintaining high semantic coherence.

Details

Motivation: To understand how LLMs handle emotional content and maintain semantic relationships in social media contexts, particularly in climate change discussions.

Method: Used continuation and response tasks with three open-source models (Gemma, Llama3, Llama3.3) and one commercial model (Claude) on Twitter and Reddit climate change discussions, analyzing emotional transitions, intensity patterns, and semantic consistency.

Result: LLMs maintain high semantic coherence but moderate negative emotions by converting them to neutral or positive emotions, and generate responses with reduced emotional intensity compared to human-authored content.

Conclusion: LLMs exhibit systematic emotional moderation and preference for neutral rational emotions while preserving semantic similarity, providing important insights for social media deployment and human-computer interaction design.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using three open-source models: Gemma, Llama3 and Llama3.3 and one commercial Model:Claude. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic consistency between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: these models show a strong tendency to moderate negative emotions. When the input text carries negative emotions such as anger, disgust, fear, or sadness, LLM tends to generate content with more neutral emotions, or even convert them into positive emotions such as joy or surprise. At the same time, we compared the LLM-generated content with human-authored content. The four models systematically generated responses with reduced emotional intensity and showed a preference for neutral rational emotions in the response task. In addition, these models all maintained a high semantic similarity with the original text, although their performance in the continuation task and the response task was different. These findings provide deep insights into the emotion and semantic processing capabilities of LLM, which are of great significance for its deployment in social media environments and human-computer interaction design.

[77] Spontaneous Giving and Calculated Greed in Language Models

Yuxuan Li, Hirokazu Shirado

Main category: cs.CL

TL;DR: LLMs with reasoning techniques like chain-of-thought and reflection reduce cooperation and norm enforcement in social dilemmas, favoring individual rationality over collective gains.

Details

Motivation: To examine whether LLMs' reasoning capabilities extend to social intelligence in cooperative contexts, specifically in economic games simulating social dilemmas.

Method: Applied chain-of-thought and reflection prompting to GPT-4o in Public Goods Game, then evaluated multiple off-the-shelf models across six cooperation and punishment games with and without explicit reasoning mechanisms.

Result: Reasoning models consistently reduced cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents showed lower collective gains, mirroring human patterns of “spontaneous giving and calculated greed.”

Conclusion: LLM architectures need to incorporate social intelligence alongside reasoning to help address rather than reinforce collective action challenges.

Abstract: Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of “spontaneous giving and calculated greed.” Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address–rather than reinforce–the challenges of collective action.

[78] S’MoRE: Structural Mixture of Residual Experts for Parameter-Efficient LLM Fine-tuning

Hanqing Zeng, Yinglong Xia, Zhuokai Zhao, Chuan Jiang, Qiang Zhang, Jiayi Liu, Qunshu Zhang, Lizhu Zhang, Xiangjun Fan, Benyu Zhang

Main category: cs.CL

TL;DR: S’MoRE is a novel framework that combines LoRA’s efficiency with MoE’s flexibility using hierarchical low-rank decomposition and residual experts, achieving superior fine-tuning performance for LLMs.

Details

Motivation: To address the limitations of existing methods - LoRA lacks flexibility while MoE has parameter inefficiency and under-utilization - by creating a balanced approach that maintains both efficiency and model capacity.

Method: Uses hierarchical low-rank decomposition of expert weights to create residuals of varying orders in a multi-layer structure, routes input tokens through sub-trees of residuals, and implements inter-layer propagation as a special Graph Neural Network.

Result: S’MoRE improves structural flexibility of traditional MoE by exponential order under similar parameter budget and achieves superior fine-tuning performance compared to existing methods.

Conclusion: S’MoRE offers a transformative approach for efficient LLM adaptation by seamlessly integrating LoRA’s efficiency with MoE’s flexibility through its innovative structural design.

Abstract: Fine-tuning pre-trained large language models (LLMs) presents a dual challenge of balancing parameter efficiency and model capacity. Existing methods like low-rank adaptations (LoRA) are efficient but lack flexibility, while Mixture-of-Experts (MoE) enhance model capacity at the cost of more & under-utilized parameters. To address these limitations, we propose Structural Mixture of Residual Experts (S’MoRE), a novel framework that seamlessly integrates the efficiency of LoRA with the flexibility of MoE. Conceptually, S’MoRE employs hierarchical low-rank decomposition of expert weights, yielding residuals of varying orders interconnected in a multi-layer structure. By routing input tokens through sub-trees of residuals, S’MoRE emulates the capacity of numerous experts by instantiating and assembling just a few low-rank matrices. We craft the inter-layer propagation of S’MoRE’s residuals as a special type of Graph Neural Network (GNN), and prove that under similar parameter budget, S’MoRE improves structural flexibility of traditional MoE (or Mixture-of-LoRA) by exponential order. Comprehensive theoretical analysis and empirical results demonstrate that S’MoRE achieves superior fine-tuning performance, offering a transformative approach for efficient LLM adaptation. Our implementation is available at: https://github.com/ZimpleX/SMoRE-LLM.

[79] UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov

Main category: cs.CL

TL;DR: This paper introduces UrduFactBench and UrduFactQA, the first hand-annotated benchmarks for Urdu fact-checking and factual consistency evaluation in LLMs, along with UrduFactCheck framework that uses translation-augmented pipelines to improve performance.

Details

Motivation: Address the gap in automated fact-checking systems for Urdu language, as existing systems are predominantly for English, leaving over 200 million Urdu speakers without reliable fact-checking tools for LLM outputs.

Method: Developed UrduFactBench (for claim verification) and UrduFactQA (for factual consistency in QA) through multi-stage annotation with native speakers. Created UrduFactCheck framework with monolingual and translation-based evidence retrieval strategies to handle limited Urdu resources.

Result: Translation-augmented pipelines consistently outperform monolingual ones across twelve evaluated LLMs. Open-source LLMs show persistent challenges in Urdu fact-checking tasks.

Conclusion: Targeted resources like UrduFactBench and UrduFactQA are crucial for improving factual reliability in low-resource languages, and translation-augmented approaches effectively mitigate resource scarcity issues in Urdu fact-checking.

Abstract: The rapid adoption of Large Language Models (LLMs) has raised important concerns about the factual reliability of their outputs, particularly in low-resource languages such as Urdu. Existing automated fact-checking systems are predominantly developed for English, leaving a significant gap for the more than 200 million Urdu speakers worldwide. In this work, we present UrduFactBench and UrduFactQA, two novel hand-annotated benchmarks designed to enable fact-checking and factual consistency evaluation in Urdu. While UrduFactBench focuses on claim verification, UrduFactQA targets the factuality of LLMs in question answering. These resources, the first of their kind for Urdu, were developed through a multi-stage annotation process involving native Urdu speakers. To complement these benchmarks, we introduce UrduFactCheck, a modular fact-checking framework that incorporates both monolingual and translation-based evidence retrieval strategies to mitigate the scarcity of high-quality Urdu evidence. Leveraging these resources, we conduct an extensive evaluation of twelve LLMs and demonstrate that translation-augmented pipelines consistently enhance performance compared to monolingual ones. Our findings reveal persistent challenges for open-source LLMs in Urdu and underscore the importance of developing targeted resources. All code and data are publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.

[80] NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging

Weiming Zhang, Qingyao Li, Xinyi Dai, Jizheng Chen, Kounianhua Du, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang

Main category: cs.CL

TL;DR: NL-DEBUGGING is a framework that uses natural language as an intermediate representation to improve code debugging, outperforming traditional methods and enabling broader modification through execution feedback.

Details

Motivation: Traditional code-level debugging falls short for complex algorithmic errors, and while LLMs show promise for code tasks, it's unclear what natural language format works best for debugging and what specific benefits it provides.

Method: Introduces NL-DEBUGGING framework that employs natural language as an intermediate representation for debugging, allowing debugging at the natural language level with direct refinement guided by execution feedback.

Result: NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through natural language reasoning.

Conclusion: Natural language reasoning has significant potential to advance automated code debugging and address complex programming challenges.

Abstract: Debugging is a critical aspect of LLM’s coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.

[81] WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber

Main category: cs.CL

TL;DR: This paper introduces WXImpactBench, the first benchmark for evaluating LLMs on understanding disruptive weather impacts, addressing the lack of available datasets and evaluation frameworks in climate change adaptation research.

Details

Motivation: Climate change adaptation requires understanding weather impacts on society, but LLM effectiveness is under-explored due to difficulty in collecting high-quality corpus and lack of benchmarks. Regional newspapers contain valuable records of community adaptation and recovery from disasters.

Method: Developed a disruptive weather impact dataset with a four-stage construction pipeline, then proposed WXImpactBench benchmark with two evaluation tasks: multi-label classification and ranking-based question answering.

Result: Extensive experiments on various LLMs provided first-hand analysis of challenges in developing weather impact understanding systems. The dataset and evaluation framework code are made available.

Conclusion: The constructed dataset and evaluation framework help society protect against vulnerabilities from disasters by enabling better understanding of disruptive weather impacts through LLM applications.

Abstract: Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

[82] Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva

Main category: cs.CL

TL;DR: PISCES is a framework for precisely erasing concepts from LLM parameters by identifying and removing feature directions associated with target concepts using automated interpretability techniques.

Details

Motivation: Current methods for removing undesirable knowledge from LLMs (sensitive info, copyrighted content) are too coarse, shallow, or ineffective, requiring more precise erasure approaches.

Method: Uses a disentangler model to decompose MLP vectors into interpretable features, identifies concept-associated features via automated interpretability, and removes them directly from model parameters.

Result: Achieves modest efficacy gains over existing methods, reducing target concept accuracy to 7.7% while improving specificity by up to 31% and robustness by up to 38%.

Conclusion: Feature-based in-parameter editing enables more precise and reliable removal of conceptual knowledge from language models.

Abstract: Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

[83] LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition

Fan Bai, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze

Main category: cs.CL

TL;DR: DEER is a training-free in-context learning approach that uses label-grounded statistics to improve named entity recognition in LLMs, outperforming existing ICL methods and achieving near-supervised fine-tuning performance.

Details

Motivation: Existing ICL methods for NER rely on task-agnostic semantic similarity for demonstration retrieval, which often yields less relevant examples and leads to inferior results.

Method: DEER leverages token-level statistics from training labels to identify informative tokens for entity recognition, enabling entity-focused demonstrations and using these statistics to detect and refine error-prone tokens through targeted reflection.

Result: Evaluated on five NER datasets across four LLMs, DEER consistently outperforms existing ICL methods and achieves performance comparable to supervised fine-tuning.

Conclusion: DEER improves example retrieval, remains effective on both seen and unseen entities, and exhibits strong robustness in low-resource settings.

Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. However, in Named Entity Recognition (NER), existing ICL methods typically rely on task-agnostic semantic similarity for demonstration retrieval, which often yields less relevant examples and leads to inferior results. We introduce DEER, a training-free ICL approach that enables LLMs to make more informed entity predictions through the use of label-grounded statistics. DEER leverages token-level statistics from training labels to identify tokens most informative for entity recognition, enabling entity-focused demonstrations. It further uses these statistics to detect and refine error-prone tokens through a targeted reflection step. Evaluated on five NER datasets across four LLMs, DEER consistently outperforms existing ICL methods and achieves performance comparable to supervised fine-tuning. Further analyses demonstrate that DEER improves example retrieval, remains effective on both seen and unseen entities, and exhibits strong robustness in low-resource settings.

[84] MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations

Kensuke Mitsuzawa, Damien Garreau

Main category: cs.CL

TL;DR: MMD-Flagger is a new method that uses Maximum Mean Discrepancy to detect hallucinations in LLM outputs by tracking distribution distances across different temperature settings.

Details

Motivation: LLMs often generate fluent but ungrounded content (hallucinations), which prevents their use in critical applications where factual accuracy is essential.

Method: MMD-Flagger tracks the Maximum Mean Discrepancy between the output to inspect and counterparts generated with various temperature parameters, using the shape of this trajectory to detect hallucinations.

Result: The method shows competitive performance on machine translation and summarization datasets, effectively detecting most hallucinations.

Conclusion: MMD-Flagger provides an effective approach for hallucination detection in LLM outputs, enabling more reliable use of these models in critical applications.

Abstract: Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content: MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the output to inspect and counterparts generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on machine translation and summarization datasets, on which it exhibits competitive performance relative to natural competitors.

[85] Robust Preference Optimization via Dynamic Target Margins

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang

Main category: cs.CL

TL;DR: γ-PO is a dynamic target margin preference optimization algorithm that improves LLM alignment by adjusting reward margins at the pairwise level, prioritizing high-confidence pairs and suppressing noise from ambiguous pairs.

Details

Motivation: The effectiveness of Direct Preference Optimization (DPO) heavily depends on data quality, which is frequently compromised by noise. Current methods lack dynamic adjustment of reward margins to handle noisy preference pairs.

Method: Proposed γ-PO, a plug-and-play method that introduces instance-specific margin calibration to strategically prioritize high-confidence pairs while suppressing potential noise from ambiguous pairs. It’s compatible with DPO variants that rely on reward margin between preference pairs.

Result: Across benchmarks such as AlpacaEval2 and Arena-Hard, γ-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. It requires minimal code changes and has negligible impact on training efficiency.

Conclusion: γ-PO provides a robust solution for enhancing LLM alignment by dynamically adjusting reward margins to handle noisy preference data effectively, while maintaining training efficiency and requiring minimal implementation changes.

Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.

[86] Many LLMs Are More Utilitarian Than One

Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney

Main category: cs.CL

TL;DR: LLMs show a ‘Utilitarian Boost’ in group settings similar to humans - they become more willing to endorse harmful actions that benefit the greater good when collaborating in multi-agent systems compared to solo reasoning.

Details

Motivation: To understand how LLMs' moral judgment changes in multi-agent collaborative settings compared to individual operation, and whether they exhibit similar group dynamics as humans in moral reasoning.

Method: Tested six LLM models on established moral dilemmas in two conditions: Solo (independent reasoning) and Group (multi-turn discussions in pairs or triads), analyzing personal dilemmas where agents decide whether to directly harm an individual for others’ benefit.

Result: All models rated moral violations as more acceptable in group settings, demonstrating a Utilitarian Boost. However, the mechanism differed from humans - LLM groups showed either reduced sensitivity to norms or enhanced impartiality rather than increased outcome sensitivity.

Conclusion: LLMs exhibit group moral reasoning dynamics similar to humans but through different cognitive mechanisms, with implications for AI alignment, multi-agent system design, and artificial moral reasoning.

Abstract: Moral judgment is integral to large language models’ (LLMs) social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function when collaborating compared to operating as individual agents. In human moral judgment, group deliberation leads to a Utilitarian Boost: a tendency to endorse norm violations that inflict harm but maximize benefits for the greatest number of people. We study whether a similar dynamic emerges in multi-agent LLM systems. We test six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reason independently, and (2) Group, where they engage in multi-turn discussions in pairs or triads. In personal dilemmas, where agents decide whether to directly harm an individual for the benefit of others, all models rated moral violations as more acceptable when part of a group, demonstrating a Utilitarian Boost similar to that observed in humans. However, the mechanism for the Boost in LLMs differed: While humans in groups become more utilitarian due to heightened sensitivity to decision outcomes, LLM groups showed either reduced sensitivity to norms or enhanced impartiality. We report model differences in when and how strongly the Boost manifests. We also discuss prompt and agent compositions that enhance or mitigate the effect. We end with a discussion of the implications for AI alignment, multi-agent design, and artificial moral reasoning. Code available at: https://github.com/baltaci-r/MoralAgents

[87] Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: MiDRE is a multimodal sarcasm detection model that combines internal reasoning (detecting image-text incongruities) with external reasoning (using structured rationales from large vision-language models) through an adaptive gating mechanism.

Details

Motivation: Existing multimodal sarcasm detection models struggle to capture deeper rationales behind sarcasm and rely mainly on shallow cues, lacking the ability to incorporate external contextual knowledge like cultural references or commonsense reasoning.

Method: Proposed MiDRE framework with two experts: internal reasoning expert for detecting image-text incongruities, and external reasoning expert using Chain-of-Thought prompting to generate structured rationales from large vision-language models. An adaptive gating mechanism dynamically weights the two experts.

Result: Experiments on two benchmark datasets show MiDRE achieves superior performance over baselines. External rationales provide valuable cues even when occasionally noisy, guiding the model toward better sarcasm understanding.

Conclusion: MiDRE effectively integrates dual reasoning paths and selectively adapts to when external knowledge is beneficial, mitigating risks of hallucinated or irrelevant signals while improving sarcasm detection performance.

Abstract: Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Unlike prior methods that treat external knowledge as static input, MiDRE selectively adapts to when such knowledge is beneficial, mitigating the risks of hallucinated or irrelevant signals from large models. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.

Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: AdS-CLIP is a lightweight framework for multimodal sarcasm detection that uses adapter-state sharing in CLIP to enable efficient cross-modal learning with minimal trainable parameters.

Details

Motivation: Existing sarcasm detection methods require full fine-tuning of large models, which is resource-intensive and unsuitable for constrained settings. Current PEFT methods underperform on complex tasks like sarcasm detection.

Method: Built on CLIP, inserts adapters only in upper layers to preserve low-level unimodal representations, and introduces adapter-state sharing where textual adapters guide visual ones for efficient cross-modal learning.

Result: Outperforms standard PEFT methods and existing multimodal baselines on two public benchmarks while using significantly fewer trainable parameters.

Conclusion: AdS-CLIP provides an effective and parameter-efficient solution for multimodal sarcasm detection, making it suitable for resource-constrained environments.

Abstract: The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining systems. Existing approaches rely on full fine-tuning of large models, making them unsuitable to adapt under resource-constrained settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS-CLIP (Adapter-state Sharing in CLIP), a lightweight framework built on CLIP that inserts adapters only in the upper layers to preserve low-level unimodal representations in the lower layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones to promote efficient cross-modal learning in the upper layers. Experiments on two public benchmarks demonstrate that AdS-CLIP not only outperforms standard PEFT methods but also existing multimodal baselines with significantly fewer trainable parameters.

[89] Steering Information Utility in Key-Value Memory for Language Model Post-Training

Chunyuan Deng, Ruidi Chang, Hanjie Chen

Main category: cs.CL

TL;DR: InfoSteer is a lightweight post-training method that treats FFN layers as associative memory and guides models to better utilize stored knowledge through forward-pass interventions or regularization, improving performance across diverse models and tasks.

Details

Motivation: Current post-training approaches like SFT don't guarantee effective use of knowledge acquired during pretraining, leaving potential untapped in language models.

Method: Treats feed-forward network layers as associative key-value memory and promotes memory vector usage via forward-pass interventions or backpropagation regularization during post-training.

Result: Consistent performance improvements across Qwen, Gemma, and Llama models on 15 downstream tasks in both in-distribution and out-of-distribution evaluations, with steered models adaptively allocating information more efficiently.

Conclusion: Vanilla post-training doesn’t fully exploit pretraining potential, and steering LMs in latent representation space enhances both performance and interpretability.

Abstract: Recent advancements in language models (LMs) have marked a shift toward the growing importance of post-training. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee the effective use of knowledge acquired during pretraining. We therefore introduce InfoSteer, a lightweight method that encourages parametric information utilization in LMs during post-training. Specifically, InfoSteer treats the feed-forward network (FFN) layer as associate key-value memory and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. This simple guidance during post-training phase yields consistent performance improvements across diverse model families – including Qwen, Gemma and Llama – spanning 15 downstream tasks in both in-distribution (ID) and out-of-distribution (OOD) evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information by placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., \texttt{,}' or \texttt{and}’). Our work underscores that vanilla post-training does not fully exploit the potential gained during pre-training, and that steering LMs in latent representation space offers a promising approach to enhance both performance and interpretability. The code is available at: https://github.com/chili-lab/InfoSteer.

[90] InsurTech innovation using natural language processing

Panyi Dong, Zhiyu Quan

Main category: cs.CL

TL;DR: This paper demonstrates how NLP transforms unstructured text into structured data for insurance analytics, showing applications in feature de-biasing, compression, and industry classification using InsurTech data.

Details

Motivation: Traditional insurance companies need to leverage alternative data sources and advanced technologies like NLP to maintain competitive advantage in the InsurTech era.

Method: Applied various NLP techniques to real-world alternative data from an InsurTech partner, focusing on feature de-biasing, feature compression, and industry classification for commercial insurance.

Result: Text-derived insights enriched traditional rating factors for commercial insurance pricing and introduced novel industry classification techniques for better risk assessment.

Conclusion: NLP is a foundational element of modern, data-driven insurance analytics rather than just a supplementary tool.

Abstract: With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations, focusing on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate feature de-biasing, feature compression, and industry classification in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classification techniques. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element of modern, data-driven insurance analytics.

[91] DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck

Main category: cs.CL

TL;DR: Zero-shot entity linker for DBLP 2025 using LLMs with novel re-ranking method based on log-probabilities of ‘yes’ token at penultimate layer.

Details

Motivation: DBLP 2025 introduces new entity type dblp:Stream (publication venues), requiring updated entity linking approach without retraining on new data.

Method: Use LLMs for zero-shot entity linking by re-ranking candidate entities based on log-probabilities of ‘yes’ token output at penultimate layer, instead of training KG-embeddings and re-rankers.

Result: Developed a zero-shot entity linker that works with DBLP 2025’s new entity structure without requiring dataset-specific training.

Conclusion: LLM-based zero-shot approach provides effective entity linking for updated knowledge graphs without retraining, using novel probability-based re-ranking method.

Abstract: In this work we present an entity linker for DBLP’s 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the “yes” token output at the penultimate layer of the LLM.

[92] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie

Main category: cs.CL

TL;DR: WEST is a speech toolkit based on large language models that supports speech understanding, generation, and interaction with three key features: fully LLM-based architecture, full-stack capabilities, and simple usability.

Details

Motivation: To create a comprehensive speech toolkit that leverages mature LLM architectures and ecosystems for various speech tasks, making advanced speech technology more accessible and reproducible.

Method: Built on large language models with reuse of mature architectures and ecosystems like Hugging Face. Supports recognition, synthesis, understanding, dialogue, and multimodal capabilities. Provides two recipe types: one fully reproducible with open-source models/data, and another trained on massive data for superior performance.

Result: WEST offers both reproducible baseline systems using open-source components and high-performance models trained on massive data. The toolkit is publicly available and extensible.

Conclusion: WEST provides a versatile, accessible speech toolkit that bridges the gap between research reproducibility and practical deployment, enabling users to either reproduce experiments or directly apply high-performance models.

Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

[93] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

Main category: cs.CL

TL;DR: A scientific reasoning foundation model that aligns natural language with scientific representations, trained on 206B tokens and fine-tuned on 40M instructions, supporting 103 tasks across multiple capability families.

Details

Motivation: To create a unified model that can handle heterogeneous scientific representations and enable faithful translation between text and scientific formats, improving cross-domain generalization and fidelity compared to specialist systems.

Method: Pretrained on 206B-token corpus of scientific text, sequences, and sequence-text pairs, then aligned via supervised fine-tuning on 40M instructions, annealed cold-start bootstrapping for chain-of-thought reasoning, and reinforcement learning with task-specific reward shaping.

Result: The model supports five capability families covering 103 tasks: faithful translation, text/knowledge extraction, property prediction, property classification, and sequence generation/design. It broadens instruction coverage, improves cross-domain generalization, and enhances fidelity compared to specialist systems.

Conclusion: Cross-discipline learning strengthens transfer and downstream reliability. The model, instruction tuning datasets, and evaluation code are open-sourced to promote scientific AI research.

Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

[94] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

Main category: cs.CL

TL;DR: ReSeek is a self-correcting framework for training LLM-powered search agents that enables dynamic error recovery during search episodes through a JUDGE action mechanism and dense process rewards.

Details

Motivation: Prior RL-based methods for search agents rely on sparse or rule-based rewards, leading to commitment to suboptimal reasoning paths without recovery ability.

Method: Introduces self-correction mechanism with JUDGE action for dynamic error recovery, dense instructive process reward function (correctness + utility rewards), and FictionalHot benchmark to avoid data contamination.

Result: Agents trained with ReSeek significantly outperform state-of-the-art baselines in task success rate and path faithfulness.

Conclusion: ReSeek provides an effective framework for training search agents with self-correction capabilities, improving performance on complex reasoning tasks.

Abstract: Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

[95] AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han

Main category: cs.CL

TL;DR: AMAS introduces a dynamic graph designer framework that enables LLM-based multi-agent systems to autonomously adapt their graph configurations based on task-specific requirements, overcoming limitations of fixed architectures and achieving superior performance across various benchmarks.

Details

Motivation: Current multi-agent systems using LLMs are limited by inflexible, hand-crafted graph topologies that lack contextual responsiveness, reducing their effectiveness across diverse academic and commercial workloads.

Method: AMAS uses a novel dynamic graph designer that autonomously identifies task-specific optimal graph configurations through lightweight LLM adaptation, eliminating reliance on universal structural templates and instead using intrinsic input properties to direct query trajectories through optimized agent pathways.

Result: Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures.

Conclusion: Context-sensitive structural adaptability is a foundational requirement for high-performance LLM multi-agent system deployments.

Abstract: Although large language models (LLMs) have revolutionized natural language processing capabilities, their practical implementation as autonomous multi-agent systems (MAS) for industrial problem-solving encounters persistent barriers. Conventional MAS architectures are fundamentally restricted by inflexible, hand-crafted graph topologies that lack contextual responsiveness, resulting in diminished efficacy across varied academic and commercial workloads. To surmount these constraints, we introduce AMAS, a paradigm-shifting framework that redefines LLM-based MAS through a novel dynamic graph designer. This component autonomously identifies task-specific optimal graph configurations via lightweight LLM adaptation, eliminating the reliance on monolithic, universally applied structural templates. Instead, AMAS exploits the intrinsic properties of individual inputs to intelligently direct query trajectories through task-optimized agent pathways. Rigorous validation across question answering, mathematical deduction, and code generation benchmarks confirms that AMAS systematically exceeds state-of-the-art single-agent and multi-agent approaches across diverse LLM architectures. Our investigation establishes that context-sensitive structural adaptability constitutes a foundational requirement for high-performance LLM MAS deployments.

[96] FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method

Yuheng Li, Jiechao Gao, Wei Han, Wenwen Ouyang, Wei Zhu, Hui Yi Leong

Main category: cs.CL

TL;DR: PI-LoRA is a novel low-rank adaptation method that automatically extracts medical decision trees from clinical texts by integrating gradient path information for better rank allocation, achieving state-of-the-art performance with reduced complexity.

Details

Motivation: Current medical decision tree construction methods rely heavily on manual annotation, which is time-consuming and laborious. There's a need for automated methods to extract MDTs from clinical guidelines and textbooks to build clinical decision support systems.

Method: Proposed PI-LoRA (Path-Integrated LoRA), a low-rank adaptation method that integrates gradient path information to capture synergistic effects between modules. This enables effective rank allocation where critical modules get appropriate ranks while less important ones are pruned.

Result: Extensive experiments on medical guideline datasets show PI-LoRA significantly outperforms existing parameter-efficient fine-tuning approaches for Text2MDT task, achieving better accuracy with substantially reduced model complexity.

Conclusion: PI-LoRA achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems with limited computational resources.

Abstract: Knowledge of the medical decision process, which can be modeled as medical decision trees (MDTs), is critical to building clinical decision support systems. However, current MDT construction methods rely heavily on time-consuming and laborious manual annotation. To address this challenge, we propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for automatically extracting MDTs from clinical guidelines and textbooks. We integrate gradient path information to capture synergistic effects between different modules, enabling more effective and reliable rank allocation. This framework ensures that the most critical modules receive appropriate rank allocations while less important ones are pruned, resulting in a more efficient and accurate model for extracting medical decision trees from clinical texts. Extensive experiments on medical guideline datasets demonstrate that our PI-LoRA method significantly outperforms existing parameter-efficient fine-tuning approaches for the Text2MDT task, achieving better accuracy with substantially reduced model complexity. The proposed method achieves state-of-the-art results while maintaining a lightweight architecture, making it particularly suitable for clinical decision support systems where computational resources may be limited.

[97] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

Seiya Ishikura, Hiroaki Yamada, Tatsuya Hiraoka, Hiroaki Yamada, Takenobu Tokunaga

Main category: cs.CL

TL;DR: Augmenting dialog data with think-aloud utterances improves LLM’s ability to model human personality traits, particularly Agreeableness and Neuroticism in the Big Five framework.

Details

Motivation: To enhance personality modeling in text chat by LLMs through think-aloud utterances that capture speakers' internal thoughts before articulation.

Method: Training persona LLMs with TAU-augmented dialog data and evaluating personality alignment using the Big Five framework.

Result: LLMs trained with TAU-augmented data showed better alignment with speakers’ Agreeableness and Neuroticism traits compared to original dialog data.

Conclusion: Think-aloud utterance augmentation effectively improves personality modeling in LLMs, with performance dependent on augmentation quality.

Abstract: This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker’s thought before articulating the utterance. We expect “persona LLMs” trained with TAU-augmented data can mimic the speaker’s personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers’ Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM’s performance.

[98] A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

João A. Leite, Arnav Arora, Silvia Gargova, João Luz, Gustavo Sampaio, Ian Roberts, Carolina Scarton, Kalina Bontcheva

Main category: cs.CL

TL;DR: LLMs can generate personalized disinformation across languages and demographics, with simple personalization prompts significantly increasing jailbreak rates and enhancing persuasiveness.

Details

Motivation: To explore LLMs' ability to generate personalized disinformation across different languages and demographic groups, which remains underexplored despite their known capacity for human-like disinformation generation.

Method: Used red teaming methodology with 8 state-of-the-art LLMs, prompting them with 324 false narratives and 150 demographic personas across 4 languages (English, Russian, Portuguese, Hindi), creating AI-TRAITS dataset of 1.6 million personalized disinformation texts.

Result: Simple personalization prompts significantly increased jailbreak likelihood (up to 10 percentage points) and altered linguistic/rhetorical patterns to enhance persuasiveness. Models like Grok and GPT showed jailbreak rates and personalization scores exceeding 85%.

Conclusion: The study exposes critical vulnerabilities in current LLMs and provides foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

Abstract: Large Language Models (LLMs) can generate human-like disinformation, yet their ability to personalise such content across languages and demographics remains underexplored. This study presents the first large-scale, multilingual analysis of persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we prompt eight state-of-the-art LLMs with 324 false narratives and 150 demographic personas (combinations of country, generation, and political orientation) across four languages–English, Russian, Portuguese, and Hindi–resulting in AI-TRAITS, a comprehensive dataset of 1.6 million personalised disinformation texts. Results show that the use of even simple personalisation prompts significantly increases the likelihood of jailbreaks across all studied LLMs, up to 10 percentage points, and alters linguistic and rhetorical patterns that enhance narrative persuasiveness. Models such as Grok and GPT exhibited jailbreak rates and personalisation scores both exceeding 85%. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.

[99] ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups

Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute

Main category: cs.CL

TL;DR: ConsistencyAI is an independent benchmark that measures factual consistency of LLMs across different user personas, showing that models provide inconsistent factual answers to identical questions depending on user demographics.

Details

Motivation: To create an impartial evaluation framework that tests whether LLMs provide factually consistent responses to users of different demographics, addressing potential bias and inconsistency in model outputs.

Method: Tested 19 LLMs by querying them for 5 facts across 15 topics, repeating 100 times with different persona contexts. Used sentence embeddings and cross-persona cosine similarity to compute factual consistency scores.

Result: Factual consistency scores ranged from 0.9065 to 0.7896 (mean 0.8656). xAI’s Grok-3 was most consistent, lightweight models ranked lowest. Consistency varied by topic - job market least consistent, G7 world leaders most consistent.

Conclusion: Both the LLM provider and topic significantly shape factual consistency. The benchmark enables reproducible evaluation and encourages persona-invariant prompting strategies to improve consistency.

Abstract: Is an LLM telling you different facts than it’s telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI’s Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

[100] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

George Flint, Kaustubh Kislay

Main category: cs.CL

TL;DR: This paper conducts a large-scale quantitative analysis of phonosemantic iconicity across 6 languages, discovering new systematic relationships between phonetics and semantics while testing previously hypothesized alignments.

Details

Motivation: To investigate the extent of systematic relationships between phonetics and semantics at scale, beyond isolated cases, across multiple diverse languages.

Method: Used a distributional approach to quantify phonosemantic iconicity by analyzing alignment of morphemes’ phonetic and semantic similarity spaces with statistical measures across 6 languages (English, Spanish, Hindi, Finnish, Turkish, Tamil).

Result: Discovered new interpretable phonosemantic alignments not previously identified, found crosslinguistic patterns, and obtained mixed results when testing 5 previously hypothesized alignments - supporting some while others showed mixed evidence.

Conclusion: Systematic phonosemantic relationships exist at scale across diverse languages, with both newly discovered patterns and partial support for some previously hypothesized alignments, challenging the view of language as largely arbitrary.

Abstract: Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations–both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes’ phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang

Main category: cs.CL

TL;DR: DPRF is a framework that improves persona fidelity in LLM role-playing agents by iteratively identifying cognitive divergences between generated behaviors and human ground truth, then refining persona profiles to enhance alignment.

Details

Motivation: Current LLM role-playing agents suffer from poor persona fidelity due to manually-created profiles that aren't validated for alignment with target individuals, leading to inaccurate behavioral simulations.

Method: Dynamic Persona Refinement Framework (DPRF) uses iterative identification of cognitive divergence (free-form or theory-grounded) between generated behaviors and human ground truth, then refines persona profiles to mitigate these divergences.

Result: DPRF consistently improved behavioral alignment considerably over baseline personas across five LLMs and four diverse behavior-prediction scenarios (formal debates, social media posts, public interviews, movie reviews), generalizing well across models and scenarios.

Conclusion: DPRF provides a robust methodology for creating high-fidelity persona profiles and enhancing validity of downstream applications like user simulation, social studies, and personalized AI.

Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF). DPRF aims to optimize the alignment of LLM RPAs’ behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences. We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews. DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios. Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

[102] MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao

Main category: cs.CL

TL;DR: Proposes a systematic framework for evaluating factual accuracy in long-form LLM outputs using large-scale datasets, multi-agent verification, and weighted metrics, with experiments showing larger LLMs maintain higher factual consistency.

Details

Motivation: Address concerns about factual accuracy of LLM outputs in high-risk domains like biomedicine, law, and education, where existing evaluation methods fail on long-form content due to complex reasoning chains and cumulative information.

Method: Integrates large-scale long-form datasets (LongHalluQA), multi-agent verification mechanisms (MAD-Fact debate-based system), and weighted evaluation metrics with fact importance hierarchy to capture varying significance of claims.

Result: Experiments on two benchmarks show larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content.

Conclusion: Provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs to guide safe deployment in sensitive domains.

Abstract: The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.

[103] Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md. Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan, Md. Mehedi Hasan Shawon, Farig Sadeque, Tahsin Reasat

Main category: cs.CL

TL;DR: The paper introduces Ben-10, a 78-hour Bengali dialect speech corpus, showing that speech foundation models struggle with regional dialects in both zero-shot and fine-tuned settings, and that dialect-specific training helps.

Details

Motivation: To investigate the effects of dialectal variations on automatic speech recognition (ASR), particularly for low-resource languages where conventional research relies on canonical forms while regional dialects are treated as fine-tuning tasks.

Method: Developed a 78-hour annotated Bengali Speech-to-Text corpus (Ben-10) and conducted investigations from linguistic and data-driven perspectives, testing speech foundation models in zero-shot and fine-tuned settings.

Result: Speech foundation models struggle heavily with regional dialect ASR in both zero-shot and fine-tuned settings. All deep learning methods struggle to model speech data under dialectal variations, but dialect-specific model training alleviates the issue.

Conclusion: The Ben-10 dataset serves as an out-of-distribution resource for ASR modeling under constrained resources, highlighting the challenges of dialectal variations in speech recognition and the need for dialect-specific approaches.

Abstract: Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

[104] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren

Main category: cs.CL

TL;DR: LATR improves RLVR by increasing trajectory diversity through lookahead tree-based rollouts, accelerating policy learning by 131% and boosting final performance by 4.2%.

Details

Motivation: Current RLVR pipelines suffer from limited trajectory diversity due to token-level stochastic sampling, leading to homogeneous reasoning paths that hinder effective policy learning.

Method: LATR uses three-stage iterative process: branching at high-uncertainty steps, lookahead simulation for each branch, and pruning similar branches to promote trajectory-level diversity.

Result: LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and DAPO algorithms across reasoning tasks.

Conclusion: LATR effectively addresses trajectory diversity limitations in RLVR, significantly improving learning efficiency and final performance through structured branching and pruning strategies.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

[105] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren

Main category: cs.CL

TL;DR: OpenRM is a tool-augmented reward model that uses external tools to gather evidence for evaluating knowledge-intensive long-form responses, outperforming existing approaches and improving LLM alignment.

Details

Motivation: Existing reward models struggle with knowledge-intensive and long-form tasks where external evidence is needed to evaluate correctness, limiting their ability to discriminate subtle quality differences.

Method: Train OpenRM with Group Relative Policy Optimization on 27K+ synthesized pairwise examples, jointly supervising tool usage and outcome accuracy. The model invokes external tools to gather relevant evidence for judgment.

Result: OpenRM substantially outperforms existing reward modeling approaches on three new datasets and two benchmarks, and improves downstream LLM alignment when used for response selection and data selection.

Conclusion: Tool-augmented reward models like OpenRM show strong potential for scaling reliable long-form evaluation and improving LLM alignment through evidence-based judgment.

Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model’s internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.

cs.CV

[106] DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes

Qirui Hou, Wenzhang Sun, Chang Zeng, Chunfeng Wang, Hao Li, Jianxun Cui

Main category: cs.CV

TL;DR: DrivingScene is an online framework that reconstructs 4D dynamic driving scenes from two consecutive surround-view images using a lightweight residual flow network and coarse-to-fine training.

Details

Motivation: Real-time, high-fidelity reconstruction of dynamic driving scenes is challenging due to complex dynamics and sparse views, with existing methods struggling to balance quality and efficiency.

Method: Uses a lightweight residual flow network that predicts non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. Employs coarse-to-fine training to avoid instabilities in end-to-end approaches.

Result: Experiments on nuScenes dataset show the method generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

Conclusion: DrivingScene successfully achieves real-time, high-fidelity 4D dynamic scene reconstruction from sparse views, demonstrating superior performance over existing methods.

Abstract: Real-time, high-fidelity reconstruction of dynamic driving scenes is challenged by complex dynamics and sparse views, with prior methods struggling to balance quality and efficiency. We propose DrivingScene, an online, feed-forward framework that reconstructs 4D dynamic scenes from only two consecutive surround-view images. Our key innovation is a lightweight residual flow network that predicts the non-rigid motion of dynamic objects per camera on top of a learned static scene prior, explicitly modeling dynamics via scene flow. We also introduce a coarse-to-fine training paradigm that circumvents the instabilities common to end-to-end approaches. Experiments on nuScenes dataset show our image-only method simultaneously generates high-quality depth, scene flow, and 3D Gaussian point clouds online, significantly outperforming state-of-the-art methods in both dynamic reconstruction and novel view synthesis.

[107] Towards Fine-Grained Human Motion Video Captioning

Guorui Song, Guocun Wang, Zhe Huang, Jing Lin, Xuefei Zhe, Jian Li, Haoqian Wang

Main category: cs.CV

TL;DR: M-ACM is a motion-augmented caption model that improves video captioning by incorporating human motion representations from mesh recovery to generate more accurate and detailed descriptions of human actions.

Details

Motivation: Existing video captioning models struggle with capturing fine-grained motion details, leading to vague or inconsistent captions that don't accurately describe human actions.

Method: Proposes Motion-Augmented Caption Model (M-ACM) using motion-aware decoding with human mesh recovery representations to highlight body dynamics. Also introduces HMI Dataset (115K video-description pairs) and HMI-Bench benchmark.

Result: M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, reducing hallucinations and improving semantic fidelity and spatial alignment.

Conclusion: M-ACM sets a new standard for motion-centric video captioning by effectively incorporating motion representations to generate more precise descriptions of human actions in videos.

Abstract: Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

[108] Combining SAR Simulators to Train ATR Models with Synthetic Data

Benjamin Camus, Julien Houssay, Corentin Le Barbu, Eric Monteux, Cédric Saleun, Christian Cochin

Main category: cs.CV

TL;DR: Training Deep Learning models for Automatic Target Recognition on SAR images using synthetic data from two complementary simulators (MOCEM and Salsa) to overcome the domain gap between synthetic and real data.

Details

Motivation: Address the lack of real labeled SAR measurements by using synthetic data, but recognize that simulations have limitations in representing real-world complexity, leading to poor generalization of ATR models on real measurements.

Method: Combine two SAR simulators with different paradigms - MOCEM (scattering centers model) and Salsa (ray tracing) - to generate synthetic datasets, then train ATR models using the ADASCA Deep Learning approach.

Result: Achieved 88% accuracy on MSTAR real measurements, demonstrating improved generalization from synthetic to real data.

Conclusion: Using complementary simulation paradigms can effectively bridge the domain gap between synthetic and real SAR data, enabling successful ATR model training when real labeled data is scarce.

Abstract: This work aims to train Deep Learning models to perform Automatic Target Recognition (ATR) on Synthetic Aperture Radar (SAR) images. To circumvent the lack of real labelled measurements, we resort to synthetic data produced by SAR simulators. Simulation offers full control over the virtual environment, which enables us to generate large and diversified datasets at will. However, simulations are intrinsically grounded on simplifying assumptions of the real world (i.e. physical models). Thus, synthetic datasets are not as representative as real measurements. Consequently, ATR models trained on synthetic images cannot generalize well on real measurements. Our contributions to this problem are twofold: on one hand, we demonstrate and quantify the impact of the simulation paradigm on the ATR. On the other hand, we propose a new approach to tackle the ATR problem: combine two SAR simulators that are grounded on different (but complementary) paradigms to produce synthetic datasets. To this end, we use two simulators: MOCEM, which is based on a scattering centers model approach, and Salsa, which resorts on a ray tracing strategy. We train ATR models using synthetic dataset generated both by MOCEM and Salsa and our Deep Learning approach called ADASCA. We reach an accuracy of almost 88 % on the MSTAR measurements.

Haoyang Zhang, Zhou Yang, Ke Sun, Yucai Pang, Guoliang Xu

Main category: cs.CV

TL;DR: Proposes MCIHN, a hybrid network using adversarial autoencoders and cross-modal interaction for multimodal emotion recognition, achieving superior performance on SIMS and MOSI datasets.

Details

Motivation: Multimodal emotion recognition faces challenges due to modality differences and difficulty in characterizing unimodal emotional information, which hinders accurate emotion recognition for human-computer interaction.

Method: Uses adversarial autoencoders (AAE) for each modality to learn discriminative emotion features, then employs Cross-modal Gate Mechanism (CGMM) to reduce modality discrepancies and generate interaction features, followed by Feature Fusion module (FFM) for multimodal fusion.

Result: Experiments on SIMS and MOSI datasets demonstrate that MCIHN achieves superior performance in multimodal emotion recognition.

Conclusion: The proposed MCIHN model effectively addresses modality differences and feature characterization challenges in multimodal emotion recognition through its hybrid architecture combining AAE, CGMM, and FFM components.

Abstract: Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.

[110] Point-level Uncertainty Evaluation of Mobile Laser Scanning Point Clouds

Ziyang Xu, Olaf Wysocki, Christoph Holst

Main category: cs.CV

TL;DR: A machine learning framework using Random Forest and XGBoost models to predict point-level uncertainty in Mobile Laser Scanning point clouds based on local geometric features, achieving ROC-AUC > 0.87.

Details

Motivation: Traditional uncertainty modeling methods require expensive high-precision reference data, which is impractical for large-scale applications. There's a need for scalable uncertainty evaluation without relying on costly reference data.

Method: Proposed a machine learning framework that learns relationships between local geometric features and point-level errors. Used ensemble learning models (Random Forest and XGBoost) trained on spatially partitioned real-world datasets to prevent data leakage.

Result: Both RF and XGBoost models effectively captured nonlinear relationships between geometric features and uncertainty, achieving mean ROC-AUC values above 0.87. Key predictive features included elevation variation, point density, and local structural complexity.

Conclusion: The framework provides a scalable, data-driven approach for uncertainty evaluation in large-scale point clouds, offering a foundation for future quality control and error analysis without requiring expensive reference data.

Abstract: Reliable quantification of uncertainty in Mobile Laser Scanning (MLS) point clouds is essential for ensuring the accuracy and credibility of downstream applications such as 3D mapping, modeling, and change analysis. Traditional backward uncertainty modeling heavily rely on high-precision reference data, which are often costly or infeasible to obtain at large scales. To address this issue, this study proposes a machine learning-based framework for point-level uncertainty evaluation that learns the relationship between local geometric features and point-level errors. The framework is implemented using two ensemble learning models, Random Forest (RF) and XGBoost, which are trained and validated on a spatially partitioned real-world dataset to avoid data leakage. Experimental results demonstrate that both models can effectively capture the nonlinear relationships between geometric characteristics and uncertainty, achieving mean ROC-AUC values above 0.87. The analysis further reveals that geometric features describing elevation variation, point density, and local structural complexity play a dominant role in predicting uncertainty. The proposed framework offers a data-driven perspective of uncertainty evaluation, providing a scalable and adaptable foundation for future quality control and error analysis of large-scale point clouds.

[111] Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer’s Disease Diagnosis

Yujie Nie, Jianzhang Ni, Yonglong Ye, Yuan-Ting Zhang, Yun Kwok Wing, Xiangqing Xu, Xin Ma, Lizhou Fan

Main category: cs.CV

TL;DR: A multimodal framework using eye-tracking and facial features achieves 95.11% accuracy in Alzheimer’s disease detection through cross-attention fusion and directional convolution modules.

Details

Motivation: Accurate AD diagnosis is crucial for timely intervention. Multimodal approaches integrating behavioral and perceptual data offer promise, but few studies have explored joint use of eye-tracking and facial features for AD detection.

Method: Proposed a multimodal cross-enhanced fusion framework with: (a) Cross-Enhanced Fusion Attention Module (CEFAM) for inter-modal interactions via cross-attention and global enhancement, and (b) Direction-Aware Convolution Module (DACM) for fine-grained directional facial features using horizontal-vertical receptive fields. Built a synchronized multimodal dataset of 25 AD patients and 25 healthy controls during visual memory-search tasks.

Result: Achieved 95.11% classification accuracy in distinguishing AD from healthy controls, outperforming traditional late fusion and feature concatenation methods. Demonstrated superior robustness and diagnostic performance.

Conclusion: The framework effectively models inter-modal dependencies and modality-specific contributions, providing a robust multimodal approach for AD diagnosis that leverages complementary eye-tracking and facial features.

Abstract: Accurate diagnosis of Alzheimer’s disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.

[112] FPGA-based Lane Detection System incorporating Temperature and Light Control Units

Ibrahim Qamar, Saber Mahmoud, Seif Megahed, Mohamed Khaled, Saleh Hesham, Ahmed Matar, Saif Gebril, Mervat Mahmoud

Main category: cs.CV

TL;DR: FPGA-based lane detection system using Sobel edge detection for intelligent vehicles, achieving real-time performance with environmental adaptability features.

Details

Motivation: Intelligent vehicles require reliable lane detection for automation applications, with need for real-time processing and environmental adaptability.

Method: Proposed FPGA-based Lane Detector Vehicle (LDV) architecture using Sobel algorithm for edge detection on 416x416 images at 150MHz frequency.

Result: System generates valid output every 1.17ms, detecting number of lanes, current lane index, and lane boundaries with automated light and temperature control for environmental adaptation.

Conclusion: The FPGA-based LDV system provides efficient real-time lane detection with environmental adaptability, suitable for intelligent vehicle applications.

Abstract: Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.

[113] ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Mingzhi Zhu, Ding Shang, Sai Qian Zhang

Main category: cs.CV

TL;DR: ESCA is a full-stack optimization framework that accelerates photorealistic codec avatar inference on VR devices through efficient quantization and custom hardware acceleration.

Details

Motivation: Photorealistic Codec Avatars enable immersive VR communication but impose high computational demands that challenge real-time inference on resource-constrained VR devices like head-mounted displays.

Method: Proposed efficient post-training quantization for Codec Avatar models and designed a custom hardware accelerator integrated into VR SoCs, combined into the ESCA full-stack optimization framework.

Result: ESCA boosts FovVideoVDP quality scores by up to +0.39 over best 4-bit baseline, delivers 3.36× latency reduction, and sustains 100 FPS rendering rate in end-to-end tests.

Conclusion: ESCA demonstrates feasibility of deploying high-fidelity codec avatars on resource-constrained devices, enabling more immersive and portable VR experiences.

Abstract: Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

[114] Neighborhood Feature Pooling for Remote Sensing Image Classification

Fahimeh Orvati Nia, Amirmohammad Mohammadi, Salim Al Kharsa, Pragati Naikare, Zigfried Hampel-Arias, Joshua Peeples

Main category: cs.CV

TL;DR: Proposes neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification that captures local relationships and improves performance with minimal parameters.

Details

Motivation: To develop an efficient texture feature extraction method for remote sensing image classification that can capture local neighborhood relationships and be easily integrated into existing networks.

Method: Uses neighborhood feature pooling (NFP) implemented with convolutional layers to capture relationships between neighboring inputs and aggregate local similarities across feature dimensions.

Result: NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead compared to baseline models.

Conclusion: NFP is an effective texture feature extraction method that can be seamlessly integrated into any network and provides consistent performance improvements for remote sensing image classification.

Abstract: In this work, we propose neighborhood feature pooling (NFP) as a novel texture feature extraction method for remote sensing image classification. The NFP layer captures relationships between neighboring inputs and efficiently aggregates local similarities across feature dimensions. Implemented using convolutional layers, NFP can be seamlessly integrated into any network. Results comparing the baseline models and the NFP method indicate that NFP consistently improves performance across diverse datasets and architectures while maintaining minimal parameter overhead.

[115] The Underappreciated Power of Vision Models for Graph Structural Understanding

Xinjian Zhao, Wei Pang, Zhongkai Xue, Xiangru Jian, Lei Zhang, Yaoyao Xu, Xiaozhuang Song, Shu Wu, Tianshu Yu

Main category: cs.CV

TL;DR: Vision models achieve comparable performance to GNNs on graph benchmarks but with different learning patterns, excelling at holistic structural understanding while GNNs struggle with global pattern abstraction.

Details

Motivation: To investigate the underappreciated potential of vision models for graph understanding, as they capture global structures intuitively like human perception, unlike GNNs' bottom-up message-passing approach.

Method: Introduced GraphAbstract benchmark to evaluate models’ ability to perceive global graph properties (organizational archetypes, symmetry, connectivity strength, critical elements) and compared vision models with GNNs.

Result: Vision models significantly outperform GNNs on tasks requiring holistic structural understanding, maintain generalizability across graph scales, while GNNs degrade with increasing graph size.

Conclusion: Vision models possess remarkable capabilities for graph structural understanding, particularly for global topological awareness and scale-invariant reasoning, opening new avenues for graph foundation models.

Abstract: Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models’ ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.

[116] Seeing Clearly and Deeply: An RGBD Imaging Approach with a Bio-inspired Monocentric Design

Zongxi Yu, Xiaolong Qian, Shaohua Gao, Qi Jiang, Yao Gao, Kailun Yang, Kaiwei Wang

Main category: cs.CV

TL;DR: The paper introduces Bionic Monocentric Imaging (BMI), a co-designed framework using bio-inspired spherical lenses to encode depth in PSFs, enabling joint high-fidelity RGB image restoration and precise depth estimation from single captures.

Details

Motivation: Address the dual challenge of achieving high-fidelity compact RGBD imaging where conventional optics struggle with RGB sharpness across depth-of-field and software-only monocular depth estimation relies on unreliable semantic priors.

Method: Developed a bio-inspired all-spherical monocentric lens that naturally encodes depth into depth-varying PSFs, created a physically-based forward model for synthetic data generation, and designed a dual-head multi-scale reconstruction network with shared encoder for joint AiF image and depth map recovery.

Result: State-of-the-art performance: depth estimation achieved Abs Rel 0.026 and RMSE 0.130, significantly outperforming software-only approaches and other deep optics systems; image restoration achieved SSIM 0.960 and LPIPS 0.082.

Conclusion: Integration of bio-inspired spherical optics with joint reconstruction algorithms effectively addresses intrinsic challenges in high-performance compact RGBD imaging, providing superior balance between image fidelity and depth accuracy.

Abstract: Achieving high-fidelity, compact RGBD imaging presents a dual challenge: conventional compact optics struggle with RGB sharpness across the entire depth-of-field, while software-only Monocular Depth Estimation (MDE) is an ill-posed problem reliant on unreliable semantic priors. While deep optics with elements like DOEs can encode depth, they introduce trade-offs in fabrication complexity and chromatic aberrations, compromising simplicity. To address this, we first introduce a novel bio-inspired all-spherical monocentric lens, around which we build the Bionic Monocentric Imaging (BMI) framework, a holistic co-design. This optical design naturally encodes depth into its depth-varying Point Spread Functions (PSFs) without requiring complex diffractive or freeform elements. We establish a rigorous physically-based forward model to generate a synthetic dataset by precisely simulating the optical degradation process. This simulation pipeline is co-designed with a dual-head, multi-scale reconstruction network that employs a shared encoder to jointly recover a high-fidelity All-in-Focus (AiF) image and a precise depth map from a single coded capture. Extensive experiments validate the state-of-the-art performance of the proposed framework. In depth estimation, the method attains an Abs Rel of 0.026 and an RMSE of 0.130, markedly outperforming leading software-only approaches and other deep optics systems. For image restoration, the system achieves an SSIM of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior balance between image fidelity and depth accuracy. This study illustrates that the integration of bio-inspired, fully spherical optics with a joint reconstruction algorithm constitutes an effective strategy for addressing the intrinsic challenges in high-performance compact RGBD imaging. Source code will be publicly available at https://github.com/ZongxiYu-ZJU/BMI.

[117] XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

Shuai Liu, Youmeng Li, Jizeng Wei

Main category: cs.CV

TL;DR: XY-Cut++ is an advanced document reading order recovery method that achieves 98.8 BLEU score, outperforming existing methods by up to 24% on complex layouts.

Details

Motivation: Existing document reading order recovery methods struggle with complex layouts like multi-column newspapers, high-overhead cross-modal interactions, and lack robust evaluation benchmarks, which are critical for RAG and LLM preprocessing.

Method: XY-Cut++ integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to enhance layout ordering accuracy compared to traditional XY-Cut techniques.

Result: Achieves state-of-the-art 98.8 BLEU overall, outperforms baselines by up to 24%, and shows consistent accuracy across simple and complex layouts on the DocBench-100 dataset.

Conclusion: XY-Cut++ establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

Abstract: Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.

[118] A Re-node Self-training Approach for Deep Graph-based Semi-supervised Classification on Multi-view Image Data

Jingjun Bi, Fadi Dornaika

Main category: cs.CV

TL;DR: RSGSLM is a graph-based semi-supervised learning method for multi-view data that combines linear feature transformation, multi-view graph fusion, dynamic pseudo-labeling, and topological imbalance correction to improve classification performance.

Details

Motivation: Traditional graph-based methods struggle with multi-view image data due to unclear graph structures and the complexity of integrating multiple views, limiting their efficiency in semi-supervised learning scenarios.

Method: Combines linear feature transformation and multi-view graph fusion in GCN framework, dynamically incorporates pseudo-labels into loss function, corrects topological imbalances by adjusting weights of boundary samples, and adds unsupervised smoothing loss.

Result: Experimental results on multi-view benchmark image datasets show RSGSLM outperforms existing semi-supervised learning approaches in multi-view contexts.

Conclusion: The proposed RSGSLM method effectively addresses challenges in multi-view graph-based semi-supervised learning and achieves superior performance while maintaining computational efficiency.

Abstract: Recently, graph-based semi-supervised learning and pseudo-labeling have gained attention due to their effectiveness in reducing the need for extensive data annotations. Pseudo-labeling uses predictions from unlabeled data to improve model training, while graph-based methods are characterized by processing data represented as graphs. However, the lack of clear graph structures in images combined with the complexity of multi-view data limits the efficiency of traditional and existing techniques. Moreover, the integration of graph structures in multi-view data is still a challenge. In this paper, we propose Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM). Our method addresses these challenges by (i) combining linear feature transformation and multi-view graph fusion within a Graph Convolutional Network (GCN) framework, (ii) dynamically incorporating pseudo-labels into the GCN loss function to improve classification in multi-view data, and (iii) correcting topological imbalances by adjusting the weights of labeled samples near class boundaries. Additionally, (iv) we introduce an unsupervised smoothing loss applicable to all samples. This combination optimizes performance while maintaining computational efficiency. Experimental results on multi-view benchmark image datasets demonstrate that RSGSLM surpasses existing semi-supervised learning approaches in multi-view contexts.

[119] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller, Fabio Barth, Jonas Golde, Georg Rehm, Alan Akbik

Main category: cs.CV

TL;DR: PISA-Bench is a multilingual vision-language benchmark derived from expert-created PISA tests, covering six languages with human-verified examples to address limitations in existing synthetic datasets.

Details

Motivation: Existing vision-language benchmarks lack high-quality, human-verified examples and are mostly limited to English, with many relying on LLM-generated synthetic content. Manual quality assurance for multilingual datasets is costly and time-consuming.

Method: Created a multilingual benchmark using English examples from PISA tests, including human-extracted instructions, questions, answer options, and images. Translated into five additional languages (Spanish, German, Chinese, French, Italian) to create a fully parallel corpus across six languages.

Result: Evaluation of state-of-the-art VLMs shows small models (<20B parameters) fail to achieve high scores, substantial performance degradation on non-English splits, and high error rates on spatial and geometric reasoning tasks.

Conclusion: PISA-Bench provides a valuable resource for advancing multilingual multimodal reasoning research, highlighting current model limitations and performance gaps across languages and reasoning types.

Abstract: Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

[120] A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

Main category: cs.CV

TL;DR: This survey provides the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs), addressing computational and data challenges through a unified taxonomy covering efficient model design, training, and data collection.

Details

Motivation: Vision-Language-Action models face deployment challenges due to substantial computational and data requirements from their large-scale foundation models, creating an urgent need for efficiency solutions.

Method: Introduces a unified taxonomy organizing techniques into three pillars: Efficient Model Design (architectures and compression), Efficient Training (reducing computational burdens), and Efficient Data Collection (addressing robotic data bottlenecks).

Result: Establishes a foundational reference for the community, summarizes representative applications, delineates key challenges, and provides a roadmap for future research in Efficient VLAs.

Conclusion: This survey serves as a comprehensive framework for understanding and advancing Efficient VLAs, with ongoing updates maintained through a dedicated project page.

Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

[121] AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets

Fakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Ehsan Samei, Michael R. Harowicz, Jayashree Kalpathy-Cramer, Kyle J. Lafata, Tina D. Tailor, Cynthia Rudin, Joseph Y. Lo

Main category: cs.CV

TL;DR: Created reproducible AI benchmarking for lung cancer screening using Duke Lung Cancer Screening dataset and public datasets, developing models for nodule detection and classification with public release of all resources.

Details

Motivation: To establish standardized benchmarking for AI models in lung cancer screening that requires large annotated CT datasets and rigorous performance evaluation.

Method: Used DLCS dataset (1,613 patients; 2,487 nodules) and external datasets (LUNA16, LUNA25, NLST-3D). For detection: trained MONAI RetinaNet models. For classification: compared five strategies including pretrained models, self-supervised foundation model, and ResNet50 with Strategic Warm-Start.

Result: DLCS-De detection model outperformed LUNA16-De on both internal (CPM 0.63 vs 0.45) and external validation (CPM 0.58 vs 0.49). ResNet50-SWS classification achieved AUCs of 0.71-0.90 across datasets, matching or exceeding baseline models.

Conclusion: Established standardized benchmarking resource for lung cancer AI research supporting model development, validation, and translation, with all code, models, and data publicly released for reproducibility.

Abstract: Background: Development of artificial intelligence (AI) models for lung cancer screening requires large, well-annotated low-dose computed tomography (CT) datasets and rigorous performance benchmarks. Purpose: To create a reproducible benchmarking resource leveraging the Duke Lung Cancer Screening (DLCS) and multiple public datasets to develop and evaluate models for nodule detection and classification. Materials & Methods: This retrospective study uses the DLCS dataset (1,613 patients; 2,487 nodules) and external datasets including LUNA16, LUNA25, and NLST-3D. For detection, MONAI RetinaNet models were trained on DLCS (DLCS-De) and LUNA16 (LUNA16-De) and evaluated using the Competition Performance Metric (CPM). For nodule-level classification, we compare five strategies: pretrained models (Models Genesis, Med3D), a self-supervised foundation model (FMCB), and ResNet50 with random initialization versus Strategic Warm-Start (ResNet50-SWS) pretrained with detection-derived candidate patches stratified by confidence. Results: For detection on the DLCS test set, DLCS-De achieved sensitivity 0.82 at 2 false positives/scan (CPM 0.63) versus LUNA16-De (0.62, CPM 0.45). For external validation on NLST-3D, DLCS-De (sensitivity 0.72, CPM 0.58) also outperformed LUNA16-De (sensitivity 0.64, CPM 0.49). For classification across multiple datasets, ResNet50-SWS attained AUCs of 0.71 (DLCS; 95% CI, 0.61-0.81), 0.90 (LUNA16; 0.87-0.93), 0.81 (NLST-3D; 0.79-0.82), and 0.80 (LUNA25; 0.78-0.82), matching or exceeding pretrained/self-supervised baselines. Performance differences reflected dataset label standards. Conclusion: This work establishes a standardized benchmarking resource for lung cancer AI research, supporting model development, validation, and translation. All code, models, and data are publicly released to promote reproducibility.

[122] Conflict Adaptation in Vision-Language Models

Xiaoyang Hu

Main category: cs.CV

TL;DR: Vision-language models exhibit human-like conflict adaptation behavior in sequential Stroop tasks, with neural representations mirroring human automaticity asymmetries between reading and color naming.

Details

Motivation: To understand whether AI models exhibit human-like cognitive control mechanisms, specifically conflict adaptation, and to identify the neural representations underlying this behavior.

Method: Used sequential Stroop task with 13 vision-language models, then employed sparse autoencoders (SAEs) to analyze task-relevant supernodes in InternVL 3.5 4B model.

Result: 12 of 13 VLMs showed conflict adaptation behavior; identified overlapping text/color supernodes in early/late layers mirroring human automaticity asymmetries; isolated conflict-modulated supernode in layers 24-25 whose ablation increased Stroop errors.

Conclusion: VLMs exhibit human-like cognitive control mechanisms with neural representations that parallel human cognitive architecture, suggesting shared computational principles between biological and artificial intelligence.

Abstract: A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

[123] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles

Ian Nell, Shane Gilroy

Main category: cs.CV

TL;DR: A vision-based driver behavior classification system that detects distracted and impaired driving through external observation using computer vision techniques like object tracking and lane position monitoring.

Details

Motivation: Road traffic accidents caused by human error, particularly distracted and impaired driving, remain a significant global concern that needs addressing.

Method: Uses advanced computer vision with YOLO object detection model, real-time object tracking, lateral displacement analysis, and custom lane estimation algorithms to detect unsafe driving behaviors like excessive lateral movement and erratic trajectories.

Result: Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.

Conclusion: The vision-based approach enables behavioral analysis of non-connected vehicles and provides an effective alternative to systems reliant on inter-vehicular communication.

Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behaviour classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviours such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioural analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.

[124] DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding

Main category: cs.CV

TL;DR: DualCap is a lightweight retrieval-augmented image captioning model that uses dual retrieval (image-to-text and image-to-image) to generate both text and visual prompts, enhancing visual representation without requiring many trainable parameters.

Details

Motivation: Existing lightweight retrieval-augmented image caption models only use retrieved data as text prompts, creating a semantic gap by leaving original visual features unenhanced, especially for object details and complex scenes.

Method: Proposes DualCap with dual retrieval mechanism: standard image-to-text retrieval for text prompts and novel image-to-image retrieval for visually similar scenes. Extracts salient keywords from similar images’ captions, encodes them, and fuses with original image features using lightweight trainable feature fusion network.

Result: Extensive experiments show the method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.

Conclusion: DualCap effectively bridges the semantic gap in lightweight retrieval-augmented image captioning by enhancing visual representation through dual retrieval mechanisms, achieving strong performance with reduced parameter requirements.

Abstract: Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.

[125] Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller, Amil V. Dravid, Guido M. Schuster, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: This paper proposes a caption-based XAI method that integrates standalone ML models into CLIP using network surgery to identify dominant concepts in predictions, improving robustness against covariate shifts.

Details

Motivation: Traditional saliency map XAI methods can be misleading when spurious and salient features overlap in pixel space, creating robustness problems in machine learning.

Method: Integrates standalone models into CLIP using novel network surgery approach to create caption-based XAI that identifies dominant concepts contributing to predictions.

Result: Developed a method that minimizes risk of models falling for covariate shifts by providing concept-based explanations rather than pixel-level saliency maps.

Conclusion: The caption-based XAI approach significantly contributes to developing more robust ML models by providing clearer explanations that avoid pixel-space overlap issues.

Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai

[126] Deep Feature Optimization for Enhanced Fish Freshness Assessment

Phi-Hung Hoang, Nam-Thuan Trinh, Van-Manh Tran, Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: A three-stage framework using deep learning and classical ML for fish freshness assessment achieves 85.99% accuracy on FFE dataset, outperforming previous methods by 8.69-22.78%.

Details

Motivation: Traditional sensory evaluation of fish freshness is subjective and inconsistent, while existing deep learning approaches lack accuracy and feature transparency.

Method: Three-stage framework: 1) Fine-tune five vision architectures, 2) Extract multi-level features to train seven classical ML classifiers, 3) Apply feature selection methods to identify compact feature subset.

Result: Best configuration (Swin-Tiny features + Extra Trees classifier + LGBM feature selection) achieved 85.99% accuracy on FFE dataset, significantly outperforming previous studies.

Conclusion: The proposed framework is effective and generalizable for visual quality evaluation tasks, combining deep and traditional decision mechanisms.

Abstract: Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.

[127] Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

Cui Yakun, Fushuo Huo, Weijie Shi, Juntao Dai, Hang Du, Zhenghao Zhu, Sirui Han, Yike Guo

Main category: cs.CV

TL;DR: The paper introduces MVFNDB, a comprehensive benchmark for multi-modal video fake news detection with 10 tasks and 9730 annotated questions to evaluate MLLMs’ perception, understanding, and reasoning abilities.

Details

Motivation: Traditional video fake news detection benchmarks focus only on final accuracy without providing fine-grained assessment of the detection process, making it a black box.

Method: Created MVFNDB benchmark with 10 tasks based on empirical analysis, and designed MVFND-CoT framework that incorporates both creator-added content and original shooting footage reasoning.

Result: The benchmark provides foundation for task definition and enables in-depth analysis of factors influencing accuracy, including video processing strategies and feature-model alignment.

Conclusion: The MVFNDB benchmark lays solid foundation for future evaluations and advancements of MLLMs in video fake news detection domain.

Abstract: The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs’ perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.

[128] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

Ruiyang Zhang, Jiahao Luo, Xiaoru Feng, Qiufan Pang, Yaodong Yang, Juntao Dai

Main category: cs.CV

TL;DR: A multi-round safety editing framework called SafeEditor is proposed to address over-refusal and safety-utility imbalance in text-to-image models through post-hoc editing.

Details

Motivation: Existing inference-time safety methods for text-to-image models suffer from limitations like over-refusal and poor balance between safety and utility, requiring a more effective solution.

Method: Proposed MR-SafeEdit dataset for safety editing and SafeEditor - a unified MLLM that performs multi-round safety editing on generated images using a post-hoc paradigm mirroring human cognitive processes.

Result: SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving better safety-utility balance.

Conclusion: The multi-round safety editing framework provides an effective model-agnostic solution for text-to-image model safety alignment.

Abstract: With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

[129] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He

Main category: cs.CV

TL;DR: Ming-Flash-Omni is an upgraded multimodal AI model using sparse Mixture-of-Experts architecture with 100B total parameters (6.1B active per token), achieving state-of-the-art performance in text-to-image generation, generative segmentation, and contextual automatic speech recognition.

Details

Motivation: To develop a more efficient and capable unified multimodal AI system that can handle vision, speech, and language tasks simultaneously, representing progress toward Artificial General Intelligence.

Method: Built upon a sparser Mixture-of-Experts variant of Ling-Flash-2.0 with 100 billion total parameters (only 6.1 billion active per token), enabling efficient scaling and unified multimodal processing.

Result: Achieved state-of-the-art results in text-to-image generation and generative segmentation, set new records on all 12 contextual ASR benchmarks, with substantial improvements in multimodal understanding, speech recognition, image generation fidelity, and editing consistency.

Conclusion: Ming-Flash-Omni demonstrates that sparse MoE architectures can enable highly efficient scaling while achieving state-of-the-art performance across multiple modalities within a single unified framework, representing significant progress toward AGI.

Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

[130] The Generation Phases of Flow Matching: a Denoising Perspective

Anne Gagneux, Ségolène Martin, Rémi Gribonval, Mathurin Massias

Main category: cs.CV

TL;DR: This paper investigates flow matching models from a denoising perspective, establishing formal connections between flow matching and denoisers to analyze generation quality.

Details

Motivation: Flow matching has achieved success but the factors influencing generation quality remain poorly understood, motivating a systematic analysis of the generation process.

Method: The authors adopt a denoising perspective and design a framework to empirically probe the generation process, establishing formal connections between flow matching models and denoisers.

Result: The framework enables principled perturbations (noise and drift) to influence sample generation, revealing distinct dynamical phases of the generative process and characterizing when denoisers succeed or fail.

Conclusion: The work provides new insights into the generative process stages and explains why denoiser performance matters at different phases of generation.

Abstract: Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.

[131] FruitProm: Probabilistic Maturity Estimation and Detection of Fruits and Vegetables

Sidharth Rai, Rahul Harsha Cheppally, Benjamin Vail, Keziban Yalçın Dokumacı, Ajay Sharda

Main category: cs.CV

TL;DR: The paper proposes a probabilistic approach for continuous fruit maturity estimation instead of discrete classification, using a modified RT-DETRv2 detector with a probabilistic head to predict maturity distributions and uncertainty.

Details

Motivation: Current deep learning approaches treat fruit maturity as discrete classification, which conflicts with the continuous nature of biological ripening, causing information loss and ambiguous class boundaries.

Method: Modified RT-DETRv2 object detector with a dedicated probabilistic head that predicts continuous distribution over maturity spectrum, learning mean maturity state and associated uncertainty for each detected object.

Result: Achieved 85.6% mAP on challenging fruit dataset, providing richer biological representation and more granular maturity assessments than classification-based methods.

Conclusion: Probabilistic approach enables more intelligent, uncertainty-aware automated systems for agriculture, offering better maturity assessments crucial for robotic harvesting decisions.

Abstract: Maturity estimation of fruits and vegetables is a critical task for agricultural automation, directly impacting yield prediction and robotic harvesting. Current deep learning approaches predominantly treat maturity as a discrete classification problem (e.g., unripe, ripe, overripe). This rigid formulation, however, fundamentally conflicts with the continuous nature of the biological ripening process, leading to information loss and ambiguous class boundaries. In this paper, we challenge this paradigm by reframing maturity estimation as a continuous, probabilistic learning task. We propose a novel architectural modification to the state-of-the-art, real-time object detector, RT-DETRv2, by introducing a dedicated probabilistic head. This head enables the model to predict a continuous distribution over the maturity spectrum for each detected object, simultaneously learning the mean maturity state and its associated uncertainty. This uncertainty measure is crucial for downstream decision-making in robotics, providing a confidence score for tasks like selective harvesting. Our model not only provides a far richer and more biologically plausible representation of plant maturity but also maintains exceptional detection performance, achieving a mean Average Precision (mAP) of 85.6% on a challenging, large-scale fruit dataset. We demonstrate through extensive experiments that our probabilistic approach offers more granular and accurate maturity assessments than its classification-based counterparts, paving the way for more intelligent, uncertainty-aware automated systems in modern agriculture

[132] Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. Paixão

Main category: cs.CV

TL;DR: This paper proposes using lightweight body landmark detection with optimized landmark subsets and spline-based imputation for efficient Brazilian Sign Language recognition, achieving 5X speed improvement while maintaining accuracy.

Details

Motivation: To overcome the time performance limitations of OpenPose-based sign language recognition while maintaining accuracy, as simply replacing with lightweight MediaPipe reduced accuracy significantly.

Method: Used landmark subset selection strategies to optimize recognition performance and implemented spline-based imputation to handle missing landmarks.

Result: Achieved comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to previous approaches.

Conclusion: Careful landmark selection combined with simple imputation techniques enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.

Abstract: This paper investigates the feasibility of using lightweight body landmark detection for the recognition of isolated signs in Brazilian Sign Language (LIBRAS). Although the skeleton-based approach by Alves et al. (2024) enabled substantial improvements in recognition performance, the use of OpenPose for landmark extraction hindered time performance. In a preliminary investigation, we observed that simply replacing OpenPose with the lightweight MediaPipe, while improving processing speed, significantly reduced accuracy. To overcome this limitation, we explored landmark subset selection strategies aimed at optimizing recognition performance. Experimental results showed that a proper landmark subset achieves comparable or superior performance to state-of-the-art methods while reducing processing time by more than 5X compared to Alves et al. (2024). As an additional contribution, we demonstrated that spline-based imputation effectively mitigates missing landmark issues, leading to substantial accuracy gains. These findings highlight that careful landmark selection, combined with simple imputation techniques, enables efficient and accurate isolated sign recognition, paving the way for scalable Sign Language Recognition systems.

[133] Pixels to Signals: A Real-Time Framework for Traffic Demand Estimation

H Mhatre, M Vyas, A Mittal

Main category: cs.CV

TL;DR: A vehicle detection method using background subtraction and DBSCAN clustering for traffic optimization systems.

Details

Motivation: Traffic congestion in growing urban cities causes delays and inefficiencies in transportation systems.

Method: Analyze sequential camera frames to compute background by averaging pixel values, then use background subtraction and DBSCAN clustering to detect vehicles.

Result: Computationally efficient vehicle detection with minimal infrastructure modification requirements.

Conclusion: The proposed methodology offers a practical and scalable solution for real-world deployment in traffic optimization systems.

Abstract: Traffic congestion is becoming a challenge in the rapidly growing urban cities, resulting in increasing delays and inefficiencies within urban transportation systems. To address this issue a comprehensive methodology is designed to optimize traffic flow and minimize delays. The framework is structured with three primary components: (a) vehicle detection, (b) traffic prediction, and (c) traffic signal optimization. This paper presents the first component, vehicle detection. The methodology involves analyzing multiple sequential frames from a camera feed to compute the background, i.e. the underlying roadway, by averaging pixel values over time. The computed background is then utilized to extract the foreground, where the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is applied to detect vehicles. With its computational efficiency and minimal infrastructure modification requirements, the proposed methodology offers a practical and scalable solution for real-world deployment.

[134] VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos

Qiucheng Wu, Handong Zhao, Zhixin Shu, Jing Shi, Yang Zhang, Shiyu Chang

Main category: cs.CV

TL;DR: VividCam enables text-to-video models to learn complex camera motions from synthetic videos instead of real videos, using disentanglement strategies to isolate motion learning from synthetic artifacts.

Details

Motivation: Current text-to-video models struggle with unconventional camera motions due to insufficient training data with intended uncommon camera movements.

Method: Proposes VividCam training paradigm using synthetic videos from basic 3D geometries rendered by engines like Unity, with multiple disentanglement strategies to isolate camera motion learning from synthetic appearance artifacts.

Result: The method synthesizes a wide range of precisely controlled and complex camera motions using simple synthetic data, demonstrating robust motion representation and mitigating domain shift.

Conclusion: VividCam successfully enables diffusion models to learn complex camera motions from synthetic videos, overcoming the limitation of scarce real training videos for unconventional camera movements.

Abstract: Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .

[135] Understanding Multi-View Transformers

Michal Stary, Julien Gaubil, Ayush Tewari, Vincent Sitzmann

Main category: cs.CV

TL;DR: This paper analyzes the inner workings of multi-view transformers like DUSt3R by probing their residual connections to understand how they develop 3D representations across layers, revealing differences from methods with explicit pose biases.

Details

Motivation: Multi-view transformers have become black-box systems where inner mechanisms are unclear, making improvements challenging and limiting use in safety-critical applications. Understanding their latent representations is needed.

Method: Probing and visualizing 3D representations from residual connections of multi-view transformer layers, specifically analyzing a DUSt3R variant to track latent state development across blocks.

Result: The analysis reveals how DUSt3R’s latent state develops across layers, shows the role of individual layers, and demonstrates that correspondences are refined with reconstructed geometry.

Conclusion: The approach provides insights into multi-view transformers’ inner workings, showing differences from methods with explicit pose biases and enabling better understanding of how these models develop 3D representations.

Abstract: Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers’ layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

[136] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

Hossein R. Nowdeh, Jie Ji, Xiaolong Ma, Fatemeh Afghah

Main category: cs.CV

TL;DR: M-SAM is a model-agnostic framework that addresses modality dominance in multimodal learning by identifying dominant modalities using Shapley values, modulating the loss landscape to prioritize robustness for dominant modalities, and updating weights to enhance contributions from all modalities.

Details

Motivation: In multimodal learning, dominant modalities often overshadow others, limiting generalization capabilities and preventing the model from effectively utilizing complementary features across different modalities.

Method: M-SAM operates in three steps per iteration: 1) Identifies dominant modality using Shapley values based on accuracy contributions, 2) Decomposes the loss landscape by modulating loss to prioritize robustness for the dominant modality, 3) Updates weights through backpropagation of modulated gradients.

Result: Extensive experiments on four diverse datasets show M-SAM outperforms state-of-the-art optimization and gradient manipulation methods, significantly balancing and improving multimodal learning performance.

Conclusion: M-SAM enables robust learning for dominant modalities while enhancing contributions from other modalities, allowing models to better explore and exploit complementary features across modalities for improved overall performance.

Abstract: In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities’ contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.

[137] IBIS: A Powerful Hybrid Architecture for Human Activity Recognition

Alison M. Fernandes, Hermes I. Del Monego, Bruno S. Chang, Anelise Munaretto, Hélder M. Fontes, Rui L. Campos

Main category: cs.CV

TL;DR: Proposes IBIS, a hybrid Inception-BiLSTM with SVM architecture for Wi-Fi sensing that achieves 99% movement recognition accuracy by improving generalization and reducing overfitting.

Details

Motivation: Wi-Fi sensing offers low-cost, non-intrusive environmental monitoring but suffers from overfitting issues where models fail to generalize to new data.

Method: Developed IBIS - a hybrid architecture combining Inception-BiLSTM with Support Vector Machine (SVM) to create robust classification boundaries using Doppler-derived data.

Result: Achieved nearly 99% movement recognition accuracy with comprehensive performance metrics and confusion matrices confirming effectiveness.

Conclusion: The IBIS approach successfully addresses overfitting in Wi-Fi sensing and demonstrates high accuracy in movement recognition applications.

Abstract: The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.

[138] FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

Reza Saadati Fard, Emmanuel Agu, Palawat Busaranuvong, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong, Lorraine Loretz

Main category: cs.CV

TL;DR: FT-ARM is a fine-tuned multimodal LLM with self-reflection mechanism that achieves 85% accuracy in pressure ulcer severity classification, outperforming previous CNN models by 4% while providing interpretable explanations.

Details

Motivation: Pressure ulcer severity classification is challenging due to subtle visual distinctions and subjective interpretation among clinicians, leading to variability. Previous AI approaches lacked interpretability despite promising accuracy.

Method: Fine-tuned multimodal large language model (LLaMA 3.2 90B) with agentic self-reflection mechanism that iteratively refines predictions by reasoning over visual features and clinical knowledge from text, inspired by clinician diagnostic reassessment.

Result: Achieved 85% accuracy in classifying PU stages I-IV on the Pressure Injury Image Dataset (PIID), surpassing prior CNN-based models by +4%. Designed and tested for live inference with clinically grounded natural-language explanations.

Conclusion: FT-ARM advances reliability, transparency, and clinical applicability of automated wound assessment by integrating fine-tuning and reflective reasoning across multimodal inputs, addressing the need for consistent and explainable PU staging.

Abstract: Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.

[139] Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8

Zahra Ebrahimi Vargoorani, Amir Mohammad Ghoreyshi, Ching Yee Suen

Main category: cs.CV

TL;DR: Proposes a deep learning-based ALPR system using YOLOv8 with semi-supervised learning framework that combines manual labeling and Grounding DINO-generated pseudo-labels, achieving high recall rates on multiple datasets.

Details

Motivation: ALPR systems face challenges from environmental factors, vehicle speeds, camera angles, and low-quality images, but are vital for traffic control, parking, vehicle tracking, toll collection, and law enforcement applications.

Method: Uses YOLOv8 for license plate detection and recognition with semi-supervised learning framework combining manually labeled data and pseudo-labels generated by Grounding DINO to automatically annotate images with bounding boxes.

Result: Achieved 94% recall on CENPARMI dataset and 91% recall on UFPR-ALPR dataset, with character error rates reported for both datasets.

Conclusion: The semi-supervised approach with Grounding DINO significantly reduces manual labeling effort while maintaining label quality, enhancing training process and overall model performance for ALPR systems.

Abstract: Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.

[140] Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models

Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

Main category: cs.CV

TL;DR: A novel multi-modal framework combining 2D mammogram visual features with clinical metadata and synthesized radiological reports through tokenization modules, achieving superior breast cancer detection performance compared to unimodal approaches.

Details

Motivation: Existing CAD systems have limitations in handling multi-modal data interpretation and require prior clinical history, making clinical deployment challenging. The study aims to develop a more practical and effective breast cancer detection system.

Method: Strategic integration of convolutional neural networks with language representations using tokenization modules to combine visual features from mammograms with structured textual descriptors from clinical metadata and synthesized reports.

Result: The multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements across diverse populations in multi-national cohort screening mammograms.

Conclusion: The proposed method establishes a new paradigm for clinically viable VLM-based CAD systems that effectively leverage both imaging data and contextual patient information through fusion mechanisms.

Abstract: Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.

[141] Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge

Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu

Main category: cs.CV

TL;DR: The authors used Auto3DSeg from MONAI to achieve top results in the BraTS 2023 challenges, winning 1st place in three segmentation tasks and 2nd place in two others.

Details

Motivation: To participate in and achieve competitive results across all 5 segmentation challenges in the BraTS 2023 cluster using automated segmentation methods.

Method: Used Auto3DSeg from MONAI framework for automated 3D medical image segmentation.

Result: Achieved 1st place in Brain Metastasis, Brain Meningioma, and BraTS-Africa challenges; 2nd place in Adult and Pediatric Glioma challenges.

Conclusion: Auto3DSeg from MONAI proved highly effective for medical image segmentation, demonstrating state-of-the-art performance across multiple brain tumor segmentation tasks.

Abstract: In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.

[142] DRIP: Dynamic patch Reduction via Interpretable Pooling

Yusen Peng, Sachin Kumar

Main category: cs.CV

TL;DR: DRIP is a method that dynamically reduces visual tokens in deeper layers of vision-language models to improve efficiency while maintaining performance.

Details

Motivation: Vision-language models require expensive large-scale pretraining, creating efficiency concerns that discourage researchers from training models from scratch.

Method: Dynamic patch Reduction via Interpretable Pooling (DRIP) adapts to input images and dynamically merges tokens in deeper layers of visual encoders.

Result: Significant GFLOP reduction while maintaining comparable classification/zero-shot performance on ImageNet training and CLIP contrastive pretraining.

Conclusion: DRIP enables efficient vision-language model training and has been validated on scientific domains through continual pretraining on large biology datasets.

Abstract: Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.

[143] Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

Main category: cs.CV

TL;DR: A vision-language integration framework that combines pre-trained visual encoders and large language models for zero-shot scene understanding, achieving significant improvements in object recognition, activity detection, and scene captioning.

Details

Motivation: To address the challenges of zero-shot scene understanding in complex real-world settings where models must recognize new objects, actions, and contexts without prior labeled examples.

Method: Develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation, leveraging CLIP/ViT visual encoders and GPT-based language models.

Result: Achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics on Visual Genome, COCO, ADE20K, and custom real-world datasets, outperforming state-of-the-art zero-shot models.

Conclusion: The framework demonstrates the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding through vision-language integration.

Abstract: Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.

[144] PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes

Xiang liu, Zhaoxiang Liu, Huan Hu, Zipeng Wang, Ping Chen, Zezhou Chen, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: A novel per-subject-tuning-free (PSTF) method for personalized image generation that enables precise facial attribute control while preserving facial identity, using face recognition features mapped to StyleGAN2’s latent space with a Triplet-Decoupled Cross-Attention module.

Details

Motivation: Existing methods struggle to achieve precise facial attribute control in a PSTF way - tuning-based methods require technical expertise and additional data, while PSTF approaches lack precise attribute control despite being more accessible.

Method: Uses face recognition model to extract facial identity features, maps them to StyleGAN2’s W+ latent space using e4e encoder, and employs Triplet-Decoupled Cross-Attention module to integrate facial identity, attribute features, and text embeddings into UNet architecture for clean separation.

Result: The method successfully generates personalized images with fine-grained control over facial attributes while preserving facial identity, without requiring additional fine-tuning or training data for individual identities.

Conclusion: The approach balances personalization with precise facial attribute control, offering an efficient and user-friendly solution for high-quality, adaptable facial image synthesis.

Abstract: Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.

[145] Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: VDRP is a framework for zero-shot human-object interaction detection that addresses visual complexity through diversity-aware prompt learning and region-specific concept retrieval, achieving state-of-the-art performance on HICO-DET benchmark.

Details

Motivation: Existing approaches fail to handle visual complexity in human-object interactions, particularly intra-class visual diversity (same verb appearing in different poses/contexts) and inter-class visual entanglement (different verbs having similar visual patterns).

Method: Proposes VDRP with two key components: (1) visual diversity-aware prompt learning that injects group-wise visual variance and uses Gaussian perturbation to capture diverse visual variations, and (2) region-specific concept retrieval from human, object, and union regions to create region-aware prompts that enhance verb-level discrimination.

Result: Achieves state-of-the-art performance on HICO-DET benchmark under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement.

Conclusion: VDRP successfully handles visual complexity in zero-shot human-object interaction detection through diversity-aware prompt learning and region-specific concept augmentation, demonstrating superior performance over existing methods.

Abstract: Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

[146] AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

Xiyu Zhang, Chong Bao, Yipeng Chen, Hongjia Zhai, Yitong Dong, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Main category: cs.CV

TL;DR: Proposes Atlanta-world guided implicit-structured Gaussian Splatting for smooth 3D reconstruction of indoor and urban scenes while preserving details and efficiency.

Details

Motivation: Existing geometric priors lack global consistency for low-texture regions, and current methods like Gaussian Splatting and implicit SDF fields suffer from discontinuities or computational inefficiencies, leading to loss of detail.

Method: Uses Atlanta-world model for accurate surface reconstruction in low-texture regions, with novel implicit-structured Gaussian Splatting representations including semantic GS representation and structure plane regularization with learnable plane indicators.

Result: Outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

Conclusion: The proposed method achieves smooth reconstruction while preserving high-frequency details and rendering efficiency, addressing key limitations in existing 3D reconstruction approaches.

Abstract: 3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

[147] Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks

Qingdong Cai, Charith Abhayaratne

Main category: cs.CV

TL;DR: Region-CAM is a novel Class Activation Mapping method that generates more complete object coverage and precise boundaries by extracting semantic information maps and performing semantic information propagation using both gradients and features.

Details

Motivation: Conventional CAM methods only highlight the most discriminative regions, failing to cover entire objects and having poor boundary alignment, which limits performance in weakly supervised learning tasks like semantic segmentation.

Method: Extracts semantic information maps (SIMs) and performs semantic information propagation (SIP) using both gradients and features from all stages of the baseline classification model, rather than just network feature weighting.

Result: Achieves 60.12% mIoU on PASCAL VOC training (13.61% improvement over CAM), 58.43% on validation (13.13% improvement), 36.38% on MS COCO (16.23% improvement), and 51.7% Top-1 localization accuracy on ILSVRC2012 (4.5% better than LayerCAM).

Conclusion: Region-CAM significantly outperforms conventional CAM methods by providing more complete object coverage and precise boundary alignment, making it highly effective for weakly supervised learning tasks.

Abstract: Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.

[148] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Malaisree P, Youwai S, Kitkobsin T, Janrungautai S, Amorndechaphon D, Rojanavasu P

Main category: cs.CV

TL;DR: DINO-YOLO combines YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient object detection in civil engineering, achieving significant performance improvements while maintaining real-time inference.

Details

Motivation: Object detection in civil engineering applications faces challenges due to limited annotated data in specialized domains, requiring data-efficient solutions.

Method: Hybrid architecture integrating DINOv3 features at input preprocessing (P0) and mid-backbone enhancement (P3) with YOLOv12, with systematic ablation across five YOLO scales and nine DINOv3 variants.

Result: Substantial improvements: Tunnel Segment Crack detection (12.4% improvement), Construction PPE (13.7% gain), KITTI (88.6% improvement) with 30-47 FPS real-time inference. Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5).

Conclusion: DINO-YOLO establishes state-of-the-art performance for civil engineering datasets with limited data while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection.

Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

[149] Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective

Wan Jiang, Jing Yan, Ruixuan Zhang, Xiaojing Chen, Changtao Miao, Zhe Li, Chenhao Lin, Yunfeng Diao, Richang Hong

Main category: cs.CV

TL;DR: The paper introduces ReGap, a training-free method for detecting AI-generated images by computing dynamic reconstruction error through controlled editing perturbations, outperforming existing reconstruction-based approaches.

Details

Motivation: Existing reconstruction-based methods for detecting AI-generated images lack theoretical foundations, rely on empirical heuristics, and suffer from limited interpretability and reliability. They often fail when real images have lower reconstruction error than generated ones, requiring data-specific threshold tuning.

Method: Proposes ReGap method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations, measuring error changes before and after editing to enhance error separation between real and generated images.

Result: Experimental results show ReGap outperforms existing baselines, exhibits robustness to common post-processing operations, and generalizes effectively across diverse conditions.

Conclusion: The proposed ReGap method provides a more reliable and interpretable approach for AI-generated image detection by addressing limitations of static reconstruction error methods through dynamic error measurement via controlled perturbations.

Abstract: The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.

[150] EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: EA3D is a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding from streaming video, using vision-language models and Gaussian feature maps.

Details

Motivation: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry, requiring a more dynamic and online approach.

Method: Uses vision-language and 2D vision foundation encoders to extract object-level knowledge from streaming video, integrates knowledge into Gaussian feature maps via online update strategy, estimates visual odometry, and employs recurrent joint optimization for attention to regions of interest.

Result: Demonstrates effectiveness across diverse benchmarks and tasks including photo-realistic rendering, semantic/instance segmentation, 3D bounding box/semantic occupancy estimation, and 3D mesh generation.

Conclusion: Establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model’s attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

[151] Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers

Gautam A. Viruthagiri, Arnuv Tandon, Gerald G. Fuller, Vinny Chandran Suja

Main category: cs.CV

TL;DR: A vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms, overcoming limitations of traditional reconstruction methods.

Details

Motivation: Clinical translation of thin film interferometry is hindered by challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional methods are computationally intensive, sensitive to noise, or require manual expert analysis.

Method: Vision transformer-based approach trained on hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, leveraging long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass.

Result: The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods.

Conclusion: This data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as dry eye disease.

Abstract: Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.

[152] Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation

Wenhao Zheng, Chenwei Sun, Wenbo Zhang, Jiancheng Lv, Xianggen Liu

Main category: cs.CV

TL;DR: TGBFN is a novel framework for quantitatively constrained CAD generation that handles multi-modal CAD sequences in a unified continuous parameter space using guided Bayesian flow.

Details

Motivation: Current generative models lag in multi-modal data generation like parametric CAD sequences due to challenges with long-range constraints and parameter sensitivity.

Method: TGBFN handles discrete commands and continuous parameters in a unified continuous differentiable space, introduces guided Bayesian flow to control CAD properties, and constructs a new dataset for evaluation.

Result: Extensive comparisons show TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences across single-condition and multi-condition tasks.

Conclusion: TGBFN successfully addresses the challenges of multi-modal CAD generation and demonstrates superior performance in constrained generation tasks.

Abstract: Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.

[153] A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li, Marco Paolieri, Leana Golubchik

Main category: cs.CV

TL;DR: This paper quantitatively studies the performance of 190 real-world vision transformers on mobile devices, comparing them with 102 CNNs to understand latency factors and developing a dataset of 1000 synthetic ViTs for latency prediction.

Details

Motivation: Given the significant advances in machine learning on mobile devices, particularly in computer vision, there is a need to understand the performance characteristics of vision transformers on mobile platforms and identify factors influencing their latency.

Method: The study compares 190 real-world ViTs with 102 CNNs, analyzes latency factors, and creates a dataset of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures across two ML frameworks and six mobile platforms.

Result: The research provides insights into factors influencing ViT latency on mobile devices and demonstrates that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications using the developed dataset.

Conclusion: The study successfully identifies key factors affecting ViT performance on mobile devices and establishes a reliable method for predicting ViT inference latency, enabling better optimization and deployment of vision transformers on mobile platforms.

Abstract: Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

[154] $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction

Kejing Xia, Jidong Jia, Ke Jin, Yucai Bai, Li Sun, Dacheng Tao, Youjian Zhang

Main category: cs.CV

TL;DR: D²GS is a LiDAR-free urban scene reconstruction framework that uses multi-view depth predictions and diffusion priors to achieve geometry quality comparable to LiDAR-based methods without requiring LiDAR sensors.

Details

Motivation: Current urban scene reconstruction methods rely on multimodal sensors (LiDAR + images), which face challenges with spatiotemporal calibration and reprojection errors. The authors aim to avoid the difficulty of acquiring accurate LiDAR depth while achieving comparable or better reconstruction quality.

Method: 1) Initialize dense point cloud from multi-view metric depth predictions with Progressive Pruning for global consistency; 2) Jointly refine Gaussian geometry and depth via Depth Enhancer using diffusion priors from depth foundation models; 3) Improve ground geometry by constraining Gaussian shape and normal attributes in road regions.

Result: Extensive experiments on Waymo dataset show the method consistently outperforms state-of-the-art methods, producing more accurate geometry even compared to methods using ground-truth LiDAR data.

Conclusion: D²GS successfully demonstrates that LiDAR-free urban scene reconstruction can achieve superior geometry quality by leveraging multi-view depth predictions and diffusion priors, eliminating the need for complex multimodal sensor setups.

Abstract: Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.

[155] Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation

Huadong Tang, Youpeng Zhao, Min Xu, Jun Wang, Qiang Wu

Main category: cs.CV

TL;DR: Proposes an Extended Context-Aware Classifier (ECAC) that dynamically adjusts semantic segmentation classifiers using both global dataset-level and local image-level contextual information to address class imbalance and improve pixel labeling accuracy.

Details

Motivation: Traditional semantic segmentation classifiers use fixed parameters that don't account for individual image characteristics and suffer from class imbalance, leading to biased results favoring majority classes.

Method: Uses a memory bank to learn dataset-level contextual information and incorporates image-specific context. Employs teacher-student network paradigm where teacher dynamically adjusts context with ground truth and transfers knowledge to student.

Result: Achieves state-of-the-art performance on ADE20K, COCO-Stuff10K, and Pascal-Context datasets.

Conclusion: Dynamic classifier adjustment using both global and local contextual information effectively addresses class imbalance and improves semantic segmentation accuracy.

Abstract: Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model’s effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.

[156] Test-Time Adaptive Object Detection with Foundation Model

Yingjie Gao, Yanan Zhang, Zhi Cai, Di Huang

Main category: cs.CV

TL;DR: First foundation model-powered test-time adaptive object detection method that eliminates source data dependency and overcomes closed-set limitations using multi-modal prompt tuning and instance dynamic memory.

Details

Motivation: Existing test-time adaptive object detection methods rely on source data and assume identical category spaces between source and target domains, which limits real-world applicability.

Method: Multi-modal Prompt-based Mean-Teacher framework with text and visual prompt tuning, Test-time Warm-start strategy for visual prompts, and Instance Dynamic Memory module with Memory Enhancement and Memory Hallucination strategies.

Result: Outperforms previous state-of-the-art methods on cross-corruption and cross-dataset benchmarks, adapting to arbitrary cross-domain and cross-category target data.

Conclusion: Proposed method successfully eliminates source data dependency and closed-set limitations while achieving superior performance in test-time adaptive object detection.

Abstract: In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM’s high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

[157] Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks

Zhifeng Wang, Minghui Wang, Chunyan Zeng, Jialong Yao, Yang Yang, Hongmin Xu

Main category: cs.CV

TL;DR: This paper proposes an online learning authentication system using YOLOv5 for face detection and a residual network for feature extraction, comparing Euclidean distances against student databases to verify identities.

Details

Motivation: The fusion of IT and AI in education, accelerated by COVID-19, has increased the importance of secure identity authentication in digital learning environments to ensure online education's security and stability.

Method: Uses YOLOv5 network trained on proprietary dataset to detect faces from students’ webcams, then employs residual network for deep feature extraction, followed by Euclidean distance comparison with student face databases.

Result: The system successfully identifies students by analyzing facial features extracted from webcam images and matching them against registered databases through distance comparisons.

Conclusion: The deep learning-based authentication approach enhances online education security and aligns with the evolving educational landscape, providing a robust solution for identity verification in digital learning.

Abstract: In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper’s focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals’ faces culled from images captured by students’ open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students’ face databases is performed, effectively ascertaining the identity of each student.

[158] AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis

Manisha More, Kavya Bhand, Kaustubh Mukdam, Kavya Sharma, Manas Kawtikwar, Hridayansh Kaware, Prajwal Kavhar

Main category: cs.CV

TL;DR: A multimodal AI diagnostic framework for early detection of skin cancer, vascular blood clots, and cardiopulmonary abnormalities using image analysis, thermal imaging, and audio signal processing.

Details

Motivation: Early diagnosis of critical diseases improves patient survival and reduces treatment costs, but existing diagnostic techniques are costly, invasive, and inaccessible in low-resource regions.

Method: Fine-tuned MobileNetV2 CNN for skin lesion classification on ISIC 2019 dataset; SVM with handcrafted features for thermal clot detection; Random Forest with MFCC features for cardiopulmonary analysis using PhysioNet and Pascal datasets.

Result: Skin cancer detection: 89.3% accuracy, 91.6% sensitivity, 88.2% specificity; Thermal clot detection: 86.4% accuracy (AUC=0.89); Cardiopulmonary analysis: 87.2% accuracy, 85.7% sensitivity. Competitive results while remaining lightweight.

Conclusion: The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions deployable on low-cost devices.

Abstract: Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.

[159] U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching

Junsheng Zhou, Xingyu Shi, Haichuan Song, Yi Fang, Yu-Shen Liu, Zhizhong Han

Main category: cs.CV

TL;DR: U-CAN is an unsupervised framework for point cloud denoising using consistency-aware Noise2Noise matching, achieving comparable results to supervised methods without requiring clean-noisy pairs.

Details

Motivation: Point clouds from scanning sensors often contain noise that negatively impacts downstream tasks. Previous methods require extensive manual effort to collect noisy-clean pairs for supervised training.

Method: U-CAN uses a neural network to infer multi-step denoising paths with Noise2Noise matching and a novel loss for statistical reasoning on multiple noisy observations. It introduces a constraint on denoised geometry consistency.

Result: Significant improvement over state-of-the-art unsupervised methods in point cloud denoising, upsampling and image denoising. Produces comparable results with supervised methods.

Conclusion: U-CAN provides an effective unsupervised alternative to supervised denoising methods, with the consistency constraint being generalizable beyond 3D to 2D image denoising.

Abstract: Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.

[160] MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo

Shiyu Qin, Zhihao Cai, Kaixuan Wang, Lin Qi, Junyu Dong

Main category: cs.CV

TL;DR: MSF-Net is a novel photometric stereo framework that extracts multi-stage features with selective update strategy and feature fusion to improve surface normal estimation accuracy, outperforming previous state-of-the-art methods.

Details

Motivation: Existing learning-based photometric stereo methods fail to accurately capture features at multiple stages and lack adequate interaction between features, leading to redundant feature extraction especially in complex areas like wrinkles and edges.

Method: Proposed MSF-Net with multi-stage feature extraction, selective update strategy to extract high-quality features, and feature fusion module to improve interaction between different features.

Result: Experimental results on DiLiGenT benchmark show MSF-Net significantly surpasses previous state-of-the-art methods in surface normal estimation accuracy.

Conclusion: The proposed MSF-Net framework effectively addresses limitations in existing photometric stereo methods by enabling better multi-stage feature extraction and interaction, leading to improved normal estimation performance.

Abstract: Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.

[161] Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation

Quang-Khai Bui-Tran, Thanh-Huy Nguyen, Hoang-Thien Nguyen, Ba-Thinh Lam, Nguyen Lan Vi Vu, Phat K. Huynh, Ulas Bagci, Min Xu

Main category: cs.CV

TL;DR: A new SFDA framework using hard sample selection and denoised patch mixing to improve medical image segmentation under privacy constraints by progressively aligning target distributions and handling noisy supervision.

Details

Motivation: Current SFDA approaches ignore sample difficulty and struggle with noisy supervision under domain shift, limiting their effectiveness for medical image segmentation under privacy constraints.

Method: Partitions unlabeled images into reliable/unreliable subsets via entropy-similarity analysis, refines pseudo-labels using Monte Carlo-based denoising masks, and employs intra- and inter-domain patch mixing objectives to transfer reliable semantics while mitigating noise.

Result: Achieves consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and state-of-the-art Dice and ASSD scores on benchmark datasets.

Conclusion: Progressive adaptation and denoised supervision are crucial for robust medical image segmentation under domain shift in privacy-constrained settings.

Abstract: Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.

[162] Balanced conic rectified flow

Kim Shin Seong, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

Main category: cs.CV

TL;DR: This paper proposes an improved rectified flow method that incorporates real images into training to reduce computational costs and bias issues in the original rectified flow approach.

Details

Motivation: Rectified flow faces challenges: 1) reflow requires large generative pairs leading to high computational costs, 2) performance depends heavily on 1-rectified flow model causing bias towards generated data.

Method: The proposed approach incorporates real images into training process, preserving ODE paths for real images and using a smaller set of generated and real images for efficient reflow process.

Result: In CIFAR-10, achieved significantly better FID scores in both one-step and full-step simulations while using only a fraction of generative pairs compared to original method. Also induces straighter paths and avoids saturation on generated images.

Conclusion: The method enables more robust ODE learning while preserving real image distribution, effectively reducing reliance on large amounts of generated data.

Abstract: Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges.

The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data. In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.

[163] Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation

Yuxiang Mao, Zhijie Zhang, Zhiheng Zhang, Jiawei Liu, Chen Zeng, Shihong Xia

Main category: cs.CV

TL;DR: The paper proposes a method for generating emotionally expressive 3D talking faces by modeling facial animation as a linear additive problem using speech and emotion blendshapes, achieving superior emotional expressivity while maintaining accurate lip synchronization.

Details

Motivation: There is a scarcity of real emotional 3D talking-face datasets, and generating emotionally expressive talking faces remains underexplored despite progress in speech-driven lip-sync animation.

Method: Model facial animation as a linear additive problem using speech and emotion blendshapes. Use VOCAset (neutral expressions) and Florence4D (3D expression sequences) to jointly learn blendshapes with sparsity constraint loss for disentanglement. Map blendshapes to FLAME model parameters for 3D Gaussian avatar animation.

Result: Qualitative and quantitative experiments show natural generation of talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies demonstrate superior emotional expressivity compared to existing methods without compromising lip-sync quality.

Conclusion: The proposed method effectively generates emotionally expressive 3D talking faces by disentangling speech and emotion blendshapes, achieving both high emotional expressivity and accurate lip synchronization.

Abstract: Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.

[164] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Yinqi Cai, Jichang Li, Zhaolun Li, Weikai Chen, Rushi Lan, Xi Xie, Xiaonan Luo, Guanbin Li

Main category: cs.CV

TL;DR: DeepShield is a deepfake detection framework that improves robustness across unseen forgeries by balancing local sensitivity and global generalization through Local Patch Guidance and Global Forgery Diversification components.

Details

Motivation: Existing deepfake detectors perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to reliance on forgery-specific artifacts, raising concerns about misuse for fraud and misinformation.

Method: Enhances CLIP-ViT encoder with Local Patch Guidance (spatiotemporal artifact modeling with patch-wise supervision) and Global Forgery Diversification (domain feature augmentation with domain-bridging and boundary-expanding feature generation).

Result: Outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

Conclusion: DeepShield demonstrates that integrating novel local and global analysis approaches significantly improves deepfake detection robustness across diverse manipulation techniques.

Abstract: Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

[165] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, Xin Jin

Main category: cs.CV

TL;DR: This paper introduces VADB, the largest video aesthetic database with 10,490 videos annotated across multiple aesthetic dimensions, and proposes VADB-Net, a dual-modal pre-training framework that outperforms existing video quality assessment models.

Details

Motivation: Video aesthetic assessment progress is limited by lack of standardized datasets and robust models, as temporal dynamics and multimodal fusion challenges prevent direct application of image-based methods.

Method: Created VADB database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions. Proposed VADB-Net, a dual-modal pre-training framework with two-stage training strategy.

Result: VADB-Net outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks.

Conclusion: The study provides a comprehensive solution for video aesthetic assessment through both dataset creation (VADB) and model development (VADB-Net), with publicly available dataset and source code.

Abstract: Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.

[166] Mapping and Classification of Trees Outside Forests using Deep Learning

Moritz Lucas, Hamid Ebrahimy, Viacheslav Barkov, Ralf Pecenka, Kai-Uwe Kühnberger, Björn Waske

Main category: cs.CV

TL;DR: Deep learning models were evaluated for classifying Trees Outside Forests (TOF) using high-resolution aerial imagery from German agricultural landscapes, with FT-UNetFormer achieving the best performance.

Details

Motivation: TOF are important for biodiversity, carbon sequestration, and microclimate regulation, but previous studies treated TOF as a single class or used rigid thresholds, limiting ecological interpretation and regional adaptability.

Method: Compared CNN, vision transformer, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, U-Net) to map four TOF categories: Forest, Patch, Linear, and Tree using high-resolution aerial imagery from four German agricultural landscapes.

Result: Models achieved good classification accuracy overall, with FT-UNetFormer performing best (mean IoU 0.74; mean F1 score 0.84). Good results for Forest and Linear classes, but challenges in classifying complex structures with high edge density (Patch and Tree classes). Generalization experiments showed need for regionally diverse training data.

Conclusion: Deep learning, particularly FT-UNetFormer, is effective for TOF classification, highlighting the importance of spatial context understanding and regionally diverse training data for reliable large-scale mapping.

Abstract: Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper

[167] RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

Zijun Liao, Yian Zhao, Xin Shan, Yu Yan, Chang Liu, Lei Lu, Xiangyang Ji, Jie Chen

Main category: cs.CV

TL;DR: A distillation framework using Vision Foundation Models to enhance lightweight object detectors without increasing deployment overhead, achieving state-of-the-art results on COCO dataset.

Details

Motivation: Lightweight network designs for high-speed inference often degrade feature representation, limiting performance improvements and practical on-device deployment. The goal is to enhance lightweight detectors using powerful VFMs while maintaining efficiency.

Method: Proposes a distillation framework with: 1) Deep Semantic Injector (DSI) module to integrate high-level VFM representations into detector deep layers, and 2) Gradient-guided Adaptive Modulation (GAM) strategy to dynamically adjust semantic transfer intensity based on gradient norm ratios.

Result: RT-DETRv4 achieves state-of-the-art results on COCO with AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS, delivering consistent performance gains across diverse DETR-based models without increasing deployment overhead.

Conclusion: The proposed framework effectively bridges architectural disparities between VFMs and lightweight detectors, enabling stable semantic transfer and significant performance improvements for real-time object detection applications.

Abstract: Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.

[168] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: LangHOPS is the first MLLM-based framework for open-vocabulary object-part instance segmentation that grounds object-part hierarchies in language space rather than visual grouping.

Details

Motivation: Prior approaches rely on heuristic or learnable visual grouping for object-part segmentation, which has limitations. The motivation is to leverage MLLM's rich knowledge and reasoning capabilities to better handle hierarchical object-part relationships.

Method: Integrates MLLM into object-part parsing pipeline to ground hierarchies in language space, uses MLLM-driven part query refinement strategy, and links multi-granularity concepts within hierarchies.

Result: Achieves state-of-the-art results: 5.5% AP improvement (in-domain) and 4.8% AP (cross-dataset) on PartImageNet, and 2.5% mIOU improvement on unseen object parts in ADE20K (zero-shot).

Conclusion: LangHOPS demonstrates the effectiveness of language-grounded hierarchy and MLLM-driven part query refinement for open-vocabulary object-part instance segmentation, outperforming previous methods across multiple challenging scenarios.

Abstract: We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

[169] Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation

Yuyang Huang, Yabo Chen, Junyu Zhou, Wenrui Dai, Xiaopeng Zhang, Junni Zou, Hongkai Xiong, Qi Tian

Main category: cs.CV

TL;DR: DPTM is a novel generation-based SFDA framework that uses target data as references to generate and progressively refine a pseudo-target domain, addressing domain discrepancy limitations in source-free domain adaptation.

Details

Motivation: Existing SFDA methods are restricted by source-target domain discrepancy - non-generation methods suffer from unreliable pseudo-labels in large domain gaps, while generation-based methods degrade due to enlarged discrepancies when creating pseudo-source data.

Method: Divides target samples into trust/non-trust sets based on pseudo-label reliability. For non-trust samples, uses a manipulation strategy to semantically transform them into new categories while maintaining target distribution via latent diffusion model. Includes progressive refinement mechanism to iteratively reduce domain discrepancy.

Result: Outperforms existing methods by large margin, achieves state-of-the-art performance on four SFDA benchmark datasets. Significantly enhances performance by up to 18.6% in scenarios with large source-target gaps.

Conclusion: DPTM effectively addresses domain discrepancy limitations in SFDA through target data-driven generation and progressive refinement, demonstrating superior performance especially in challenging scenarios with large domain gaps.

Abstract: Source-free domain adaptation (SFDA) is a challenging task that tackles domain shifts using only a pre-trained source model and unlabeled target data. Existing SFDA methods are restricted by the fundamental limitation of source-target domain discrepancy. Non-generation SFDA methods suffer from unreliable pseudo-labels in challenging scenarios with large domain discrepancies, while generation-based SFDA methods are evidently degraded due to enlarged domain discrepancies in creating pseudo-source data. To address this limitation, we propose a novel generation-based framework named Diffusion-Driven Progressive Target Manipulation (DPTM) that leverages unlabeled target data as references to reliably generate and progressively refine a pseudo-target domain for SFDA. Specifically, we divide the target samples into a trust set and a non-trust set based on the reliability of pseudo-labels to sufficiently and reliably exploit their information. For samples from the non-trust set, we develop a manipulation strategy to semantically transform them into the newly assigned categories, while simultaneously maintaining them in the target distribution via a latent diffusion model. Furthermore, we design a progressive refinement mechanism that progressively reduces the domain discrepancy between the pseudo-target domain and the real target domain via iterative refinement. Experimental results demonstrate that DPTM outperforms existing methods by a large margin and achieves state-of-the-art performance on four prevailing SFDA benchmark datasets with different scales. Remarkably, DPTM can significantly enhance the performance by up to 18.6% in scenarios with large source-target gaps.

[170] GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

Yang Jin, Guangyu Guo, Binglu Wang

Main category: cs.CV

TL;DR: GaTector+ is a unified framework that jointly performs gaze object detection and gaze following without requiring head-related priors during inference, using a shared backbone with task-specific blocks and novel attention mechanisms.

Details

Motivation: Previous methods solve gaze object detection and gaze following separately and depend on head-related prior knowledge, requiring auxiliary networks and preventing joint optimization.

Method: Uses expanded specific-general-specific feature extractor with shared backbone, embeds head detection branch, proposes head-based attention mechanism to fuse features, and introduces attention supervision for gaze heatmap learning.

Result: Experimental results on multiple benchmark datasets demonstrate effectiveness in both gaze object detection and gaze following tasks.

Conclusion: GaTector+ successfully eliminates dependency on head-related priors during inference while achieving strong performance through unified framework and novel mechanisms.

Abstract: Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

[171] Prototype-Driven Adaptation for Few-Shot Object Detection

Yushen Huang, Zhiming Wang

Main category: cs.CV

TL;DR: PDA is a lightweight plug-in metric head for DeFRCN that uses prototype-based matching to improve few-shot object detection by reducing base-class bias and improving calibration.

Details

Motivation: Few-shot object detection suffers from base-class bias and unstable calibration when only limited novel samples are available, requiring complementary approaches to standard linear classifiers.

Method: Maintains support-only prototypes in learnable projection space, uses prototype-conditioned RoI alignment, adapts prototypes via EMA updates during fine-tuning, and employs best-of-K matching with temperature-scaled fusion.

Result: Consistently improves novel-class performance on VOC FSOD and GFSOD benchmarks with minimal impact on base classes and negligible computational overhead.

Conclusion: PDA provides an effective prototype-driven approach that enhances few-shot object detection performance while maintaining protocol compliance and computational efficiency.

Abstract: Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based “second opinion” complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.

[172] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang

Main category: cs.CV

TL;DR: MMEdge is a real-time multimodal inference framework for edge devices that uses pipelined sensing and encoding to reduce latency while maintaining accuracy through temporal aggregation and adaptive optimization.

Details

Motivation: Real-time multimodal inference on resource-constrained edge devices is essential for applications like autonomous driving and human-computer interaction, but prior work overlooks the coupling between sensing dynamics and model execution, as well as inter-modality dependencies.

Method: MMEdge decomposes inference into fine-grained sensing/encoding units for incremental computation, uses temporal aggregation to capture dynamics across units, incorporates adaptive configuration optimizer for optimal sensing under latency constraints, and employs cross-modal speculative skipping to bypass slower modalities when early predictions are confident.

Result: Evaluation on public multimodal datasets and real-world UAV testbed shows MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

Conclusion: MMEdge provides an effective solution for real-time multimodal inference on edge devices by addressing the tight coupling between sensing and execution through pipelined design and adaptive optimization techniques.

Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

[173] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Changsheng Xu

Main category: cs.CV

TL;DR: StreamingCoT is a new dataset for streaming VideoQA with temporally evolving reasoning and multimodal Chain-of-Thought tasks, addressing limitations in current datasets through dynamic hierarchical annotation and explicit reasoning chain generation.

Details

Motivation: Current VideoQA datasets have static annotations that don't capture evolving answers in temporal video streams, and lack explicit reasoning process annotations, limiting model interpretability and logical deduction capabilities.

Method: Dynamic hierarchical annotation architecture generating per-second dense descriptions and temporally-dependent semantic segments through similarity fusion, paired with explicit reasoning chain generation via spatiotemporal object extraction, object state transition-based reasoning paths using LLMs, and human-verified validation.

Result: Created StreamingCoT dataset with temporally evolving reasoning capabilities and multimodal CoT tasks, establishing foundation for streaming video understanding and complex temporal reasoning research.

Conclusion: StreamingCoT addresses critical limitations in current VideoQA datasets and provides a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference.

Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

[174] Informative Sample Selection Model for Skeleton-based Action Recognition with Limited Training Samples

Zhigang Tu, Zhengbo Zhang, Jia Gong, Junsong Yuan, Bo Du

Main category: cs.CV

TL;DR: This paper proposes a novel approach for semi-supervised 3D action recognition using active learning, reformulating the problem as a Markov Decision Process (MDP) and using hyperbolic space projection to enhance sample selection.

Details

Motivation: Existing active learning methods in semi-supervised 3D action recognition select representative skeleton sequences, but these may not be the most informative since the model may already have similar knowledge from previous samples.

Method: Reformulate semi-supervised 3D action recognition as a Markov Decision Process (MDP), train a sample selection model using this framework, project state-action pairs from Euclidean to hyperbolic space for enhanced representation, and apply meta tuning for faster real-world deployment.

Result: Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of the proposed method.

Conclusion: The MDP-based approach with hyperbolic space projection and meta tuning provides an effective solution for intelligent sample selection in semi-supervised 3D action recognition, outperforming traditional active learning methods.

Abstract: Skeleton-based human action recognition aims to classify human skeletal sequences, which are spatiotemporal representations of actions, into predefined categories. To reduce the reliance on costly annotations of skeletal sequences while maintaining competitive recognition accuracy, the task of 3D Action Recognition with Limited Training Samples, also known as semi-supervised 3D Action Recognition, has been proposed. In addition, active learning, which aims to proactively select the most informative unlabeled samples for annotation, has been explored in semi-supervised 3D Action Recognition for training sample selection. Specifically, researchers adopt an encoder-decoder framework to embed skeleton sequences into a latent space, where clustering information, combined with a margin-based selection strategy using a multi-head mechanism, is utilized to identify the most informative sequences in the unlabeled set for annotation. However, the most representative skeleton sequences may not necessarily be the most informative for the action recognizer, as the model may have already acquired similar knowledge from previously seen skeleton samples. To solve it, we reformulate Semi-supervised 3D action recognition via active learning from a novel perspective by casting it as a Markov Decision Process (MDP). Built upon the MDP framework and its training paradigm, we train an informative sample selection model to intelligently guide the selection of skeleton sequences for annotation. To enhance the representational capacity of the factors in the state-action pairs within our method, we project them from Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning strategy to accelerate the deployment of our method in real-world scenarios. Extensive experiments on three 3D action recognition benchmarks demonstrate the effectiveness of our method.

[175] 3D CT-Based Coronary Calcium Assessment: A Feature-Driven Machine Learning Framework

Ayman Abaid, Gianpiero Guidone, Sara Alsubai, Foziyah Alquahtani, Talha Iqbal, Ruth Sharif, Hesham Elzomor, Emiliano Bianchini, Naeif Almagal, Michael G. Madden, Faisal Sharif, Ihsan Ullah

Main category: cs.CV

TL;DR: A radiomics-based pipeline using pseudo-labeling for CAC scoring from non-contrast CCTA scans, outperforming foundation model features with 84% accuracy.

Details

Motivation: To address limited annotated data for coronary artery calcium scoring in non-contrast CCTA scans, eliminating need for expert segmentations.

Method: Proposed radiomics pipeline with pseudo-labeling for training labels, compared with pretrained foundation models (CT-FM, RadImageNet) features using traditional classifiers.

Result: Radiomics models significantly outperformed CNN embeddings from foundation models (84% accuracy, p<0.05) on 182-patient CCTA dataset classifying zero vs non-zero calcium scores.

Conclusion: Radiomics-based approach is effective for CAC scoring without expert annotations, outperforming deep learning features from foundation models.

Abstract: Coronary artery calcium (CAC) scoring plays a crucial role in the early detection and risk stratification of coronary artery disease (CAD). In this study, we focus on non-contrast coronary computed tomography angiography (CCTA) scans, which are commonly used for early calcification detection in clinical settings. To address the challenge of limited annotated data, we propose a radiomics-based pipeline that leverages pseudo-labeling to generate training labels, thereby eliminating the need for expert-defined segmentations. Additionally, we explore the use of pretrained foundation models, specifically CT-FM and RadImageNet, to extract image features, which are then used with traditional classifiers. We compare the performance of these deep learning features with that of radiomics features. Evaluation is conducted on a clinical CCTA dataset comprising 182 patients, where individuals are classified into two groups: zero versus non-zero calcium scores. We further investigate the impact of training on non-contrast datasets versus combined contrast and non-contrast datasets, with testing performed only on non contrast scans. Results show that radiomics-based models significantly outperform CNN-derived embeddings from foundation models (achieving 84% accuracy and p<0.05), despite the unavailability of expert annotations.

[176] More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi

Main category: cs.CV

TL;DR: CoherentAD is a training-free method that generates coherent Audio Description sequences by selecting from multiple candidates across time intervals, outperforming independent generation approaches.

Details

Motivation: Current automatic Audio Description methods generate each description independently, resulting in repetitive and incoherent sequences that fail to help visually impaired audiences visualize unfolding scenes.

Method: Generate multiple candidate descriptions for each AD time interval, then perform auto-regressive selection across the sequence to form a coherent narrative. Also introduces StoryRecall metric for sequence-level evaluation.

Result: The method produces coherent AD sequences with enhanced narrative understanding and outperforms prior approaches that rely on independent generations.

Conclusion: CoherentAD effectively addresses the coherence problem in automatic Audio Description generation through candidate selection and sequence-level optimization.

Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

[177] Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

Main category: cs.CV

TL;DR: PEP-FedPT is a federated prompt tuning framework for Vision Transformers that achieves both generalization and personalization through Class-Contextualized Mixed Prompts, combining class-specific prompts with global prompts adaptively.

Details

Motivation: Visual Prompt Tuning is effective for parameter-efficient fine-tuning but struggles in federated learning - global prompt tuning lacks generalization across heterogeneous clients, while personalized tuning overfits to local data.

Method: Proposes Class-Contextualized Mixed Prompt (CCMP) that maintains class-specific prompts alongside a global prompt, adaptively combining them using weights from global class prototypes and client class priors for per-sample personalization without client-dependent parameters.

Result: Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets show PEP-FedPT consistently surpasses state-of-the-art baselines under diverse data heterogeneity scenarios.

Conclusion: PEP-FedPT establishes a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers, achieving both generalization and personalization effectively.

Abstract: Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

[178] Instance-Level Composed Image Retrieval

Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias

Main category: cs.CV

TL;DR: The paper introduces i-CIR, a new instance-level composed image retrieval dataset, and BASIC, a training-free method that uses pre-trained VLMs to achieve state-of-the-art performance on CIR tasks.

Details

Motivation: Progress in composed image retrieval is limited by the lack of high-quality training and evaluation data, particularly for instance-level retrieval where the goal is to find the same specific object under various modifications.

Method: Proposes BASIC - a training-free approach that separately estimates query-image-to-image and query-text-to-image similarities using pre-trained VLMs, then performs late fusion to upweight images satisfying both queries while downweighting those matching only one.

Result: BASIC achieves state-of-the-art performance on both the new i-CIR dataset and existing semantic-level CIR datasets, demonstrating effectiveness even without training data.

Conclusion: The combination of the i-CIR dataset and BASIC method addresses key data limitations in CIR research and provides a strong baseline for future work in instance-level composed image retrieval.

Abstract: The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.

[179] SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments

Hongjie Zhang, Gideon Billings, Stefan B. Williams

Main category: cs.CV

TL;DR: SPADE: A monocular depth estimation pipeline that combines pre-trained relative depth estimators with sparse depth priors to produce dense, metric scale depth maps for underwater vehicle navigation.

Details

Motivation: Underwater infrastructure inspection faces challenges with human divers and remotely operated vehicles due to perceptual limitations in complex structures and turbid water. Enhancing spatial awareness of underwater vehicles is crucial for reducing piloting risks and enabling autonomy.

Method: Two-stage approach: 1) Scales relative depth map with sparse depth points, 2) Refines metric prediction using proposed Cascade Conv-Deformable Transformer blocks. Combines pre-trained relative depth estimator with sparse depth priors.

Result: Achieves improved accuracy and generalization over state-of-the-art baselines. Runs efficiently at over 15 FPS on embedded hardware.

Conclusion: SPADE promises to support practical underwater inspection and intervention by providing dense, metric scale depth maps for enhanced spatial awareness of underwater vehicles.

Abstract: Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.

[180] Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

Doan-Van-Anh Ly, Thi-Thu-Hien Pham, Thanh-Hai Le

Main category: cs.CV

TL;DR: ResNet-based UNet3+ with CBAM attention module outperforms Transformer and Mamba backbones for liver tumor segmentation in CECT images, achieving best Dice score (0.755), IoU (0.662), and boundary precision.

Details

Motivation: Liver structure segmentation in multi-phase CECT is crucial for computer-aided diagnosis and treatment planning of liver diseases including tumor detection.

Method: Evaluated UNet-based architectures with various backbones (ResNet, Transformer, Mamba) pretrained weights. Introduced attention mechanisms including CBAM module to improve segmentation quality.

Result: ResNetUNet3+ with CBAM achieved best performance: Dice 0.755, IoU 0.662, HD95 77.911, accuracy 0.925, specificity 0.926. ResNet consistently outperformed Transformer and Mamba alternatives.

Conclusion: Classical ResNet architecture combined with modern attention modules remains highly competitive for medical image segmentation, offering promising direction for liver tumor detection in clinical practice.

Abstract: Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model’s superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region’s most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

[181] RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Peng Ye, Bangyin Xiang, Zhibo Wang, Wei Cheng, Gang Yu, Tao Chen

Main category: cs.CV

TL;DR: RegionE is a region-aware framework that accelerates instruction-based image editing by distinguishing between edited and unedited regions, applying one-step prediction for unedited areas and optimized local denoising for edited regions.

Details

Motivation: Existing IIE models treat all image regions uniformly despite significant differences in generation difficulty and computational redundancy between edited and unedited areas, leading to inefficiency.

Method: Three-component framework: 1) Adaptive region partition based on trajectory analysis, 2) Region-aware generation with one-step prediction for unedited regions and local iterative denoising with KV cache for edited regions, 3) Adaptive velocity decay cache to accelerate local denoising.

Result: Achieved acceleration factors of 2.57, 2.41, and 2.06 on Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit respectively, while preserving semantic and perceptual fidelity according to GPT-4o evaluations.

Conclusion: RegionE provides an effective training-free acceleration method for IIE tasks by leveraging region-aware generation strategies, significantly improving efficiency without compromising quality.

Abstract: Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose RegionE, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o confirmed that semantic and perceptual fidelity were well preserved.

[182] Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan

Main category: cs.CV

TL;DR: Hawk introduces speculative decoding for autoregressive image generation, achieving 1.71x speedup while maintaining image quality by leveraging spatial structure to guide draft model predictions.

Details

Motivation: Autoregressive image generation models produce high-quality images but suffer from slow inference due to sequential token-by-token decoding. Speculative decoding has shown success in text generation but faces challenges in image generation due to larger sampling space and inadequate use of spatial structure.

Method: Hawk uses speculative decoding with a lightweight draft model that harnesses the spatial structure of images to guide predictions. It models local dependencies by leveraging the two-dimensional spatial arrangement of image tokens.

Result: Experimental results on multiple text-to-image benchmarks show 1.71x speedup over standard autoregressive models while preserving both image fidelity and diversity.

Conclusion: Hawk successfully applies speculative decoding to image generation by effectively utilizing spatial structure, achieving significant speedup without compromising image quality.

Abstract: Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

[183] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

Main category: cs.CV

TL;DR: This survey provides a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing progress in MLLMs and introducing open benchmarks for evaluation across 2D/3D spaces and embodied AI.

Details

Motivation: Humans have strong spatial reasoning abilities through multimodal observations, but systematic reviews and benchmarks for large multimodal reasoning models remain limited.

Method: The survey categorizes recent progress in multimodal large language models (MLLMs), examines spatial reasoning across 2D/3D tasks, embodied AI, and emerging modalities like audio and egocentric video.

Result: The survey establishes a foundation for multimodal spatial reasoning research and provides open benchmarks for evaluation, with code available on GitHub.

Conclusion: This survey offers comprehensive insights into the growing field of multimodal spatial reasoning and provides valuable resources for future research through established benchmarks.

Abstract: Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

[184] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu

Main category: cs.CV

TL;DR: FreeArt3D is a training-free framework for articulated 3D object generation that repurposes pre-trained static 3D diffusion models as shape priors, extending Score Distillation Sampling to handle articulation as an additional generative dimension.

Details

Motivation: Articulated 3D objects are crucial for robotics, AR/VR, and animation, but existing approaches either require dense-view supervision or produce coarse approximations without proper textures. While static 3D generation has advanced significantly, extending native 3D diffusion models to articulated objects presents major challenges.

Method: FreeArt3D extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. It uses a pre-trained static 3D diffusion model (e.g., Trellis) as a shape prior and jointly optimizes geometry, texture, and articulation parameters from a few images captured in different articulation states.

Result: The method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. It completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

Conclusion: FreeArt3D provides an effective training-free solution for articulated 3D generation that leverages existing static 3D diffusion models, avoiding the need for task-specific training or large-scale articulated datasets while achieving superior results.

Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object’s geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

[185] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

Main category: cs.CV

TL;DR: VFXMaster is a unified, reference-based framework for VFX video generation that treats effect generation as an in-context learning task, enabling reproduction of diverse dynamic effects from reference videos to target content with strong generalization to unseen effects.

Details

Motivation: Current VFX generation methods rely on one-LoRA-per-effect paradigm, which is resource-intensive and cannot generalize to unseen effects, limiting scalability and creative potential.

Method: Uses in-context conditioning strategy with reference video prompts, in-context attention mask to decouple and inject effect attributes, and one-shot effect adaptation mechanism for rapid generalization to unseen effects.

Result: Effectively imitates various categories of effect information and demonstrates outstanding generalization to out-of-domain effects.

Conclusion: VFXMaster provides a unified solution for VFX video generation that overcomes limitations of previous methods and will release code, models, and dataset to advance future research.

Abstract: Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

[186] Functional correspondence by matrix completion

Artiom Kovnatsky, Michael M. Bronstein, Xavier Bresson, Pierre Vandergheynst

Main category: cs.CV

TL;DR: The paper presents a method for finding dense intrinsic correspondence between manifolds using functional framework, formulated as matrix completion with manifold geometric structure and L1 norm for functional localization.

Details

Motivation: To address the problem of finding dense intrinsic correspondence between manifolds, particularly in scenarios with scarce data availability.

Method: Poses functional correspondence as matrix completion with manifold geometric structure, uses L1 norm for functional localization, and develops efficient numerical procedures.

Result: The method achieves accuracy comparable to state-of-the-art correspondence algorithms on non-rigid shape matching benchmarks, with particular advantages in scarce data settings.

Conclusion: The proposed functional framework with matrix completion and L1 norm localization provides an effective approach for dense intrinsic correspondence, especially beneficial when limited data is available.

Abstract: In this paper, we consider the problem of finding dense intrinsic correspondence between manifolds using the recently introduced functional framework. We pose the functional correspondence problem as matrix completion with manifold geometric structure and inducing functional localization with the $L_1$ norm. We discuss efficient numerical procedures for the solution of our problem. Our method compares favorably to the accuracy of state-of-the-art correspondence algorithms on non-rigid shape matching benchmarks, and is especially advantageous in settings when only scarce data is available.

[187] Single Image Estimation of Cell Migration Direction by Deep Circular Regression

Lennart Bruns, Lucas Lamparter, Milos Galic, Xiaoyi Jiang

Main category: cs.CV

TL;DR: This paper presents a deep circular regression method for estimating cell migration direction from single images, achieving ~17° mean error, significantly better than previous classification-based approaches.

Details

Motivation: Existing methods using classification CNNs with four quadrants provide limited directional resolution, and single-image migration direction estimation enables new applications that weren't previously possible.

Method: The authors use deep circular regression with cycle-sensitive methods to estimate cell migration direction from single images, focusing on continuous directional resolution rather than discrete classification.

Result: On two common datasets, the method achieves a mean estimation error of approximately 17°, which is a significant improvement over previous work that reported errors of 30° and 34° respectively.

Conclusion: Deep circular regression with cycle-sensitive methods provides superior directional resolution for cell migration estimation from single images compared to classification-based approaches.

Abstract: In this paper, we address the problem of estimating the migration direction of cells based on a single image. A solution to this problem lays the foundation for a variety of applications that were previously not possible. To our knowledge, there is only one related work that employs a classification CNN with four classes (quadrants). However, this approach does not allow for detailed directional resolution. We tackle the single image estimation problem using deep circular regression, with a particular focus on cycle-sensitive methods. On two common datasets, we achieve a mean estimation error of $\sim!17^\circ$, representing a significant improvement over previous work, which reported estimation error of $30^\circ$ and $34^\circ$, respectively.

[188] U-DECN: End-to-End Underwater Object Detection ConvNet with Improved DeNoising Training

Zhuoyan Liu, Bo Wang, Bing Wang, Ye Li

Main category: cs.CV

TL;DR: U-DECN is a query-based end-to-end object detector designed for underwater environments, addressing deployment challenges and color cast noise with optimized ConvNet architecture and specialized denoising methods.

Details

Motivation: Underwater object detection requires fast, efficient detectors for embedded devices, but existing methods (NMS-based detectors and transformer architectures) are not deployment-friendly. Color cast noise in underwater environments complicates network designs.

Method: Integrates DETR variants into DECO with ConvNet encoder-decoder. Uses Deformable Convolution in SIM, Separate Contrastive DeNoising Forward, and Underwater Color DeNoising Query to handle color cast noise and improve generalization.

Result: Achieves 64.0 AP on DUO and 58.1 AP on RUOD datasets. Runs at 21 FPS on NVIDIA AGX Orin (5x faster than Deformable DETR and DINO). Outperforms other state-of-the-art query-based detectors.

Conclusion: U-DECN successfully addresses underwater deployment challenges and color cast noise, providing efficient, high-performance object detection suitable for embedded underwater devices.

Abstract: Underwater object detection has higher requirements of running speed and deployment efficiency for the detector due to its specific environmental challenges. NMS of two- or one-stage object detectors and transformer architecture of query-based end-to-end object detectors are not conducive to deployment on underwater embedded devices with limited processing power. As for the detrimental effect of underwater color cast noise, recent underwater object detectors make network architecture or training complex, which also hinders their application and deployment on unmanned underwater vehicles. In this paper, we propose the Underwater DECO with improved deNoising training (U-DECN), the query-based end-to-end object detector (with ConvNet encoder-decoder architecture) for underwater color cast noise that addresses the above problems. We integrate advanced technologies from DETR variants into DECO and design optimization methods specifically for the ConvNet architecture, including Deformable Convolution in SIM and Separate Contrastive DeNoising Forward methods. To address the underwater color cast noise issue, we propose an Underwater Color DeNoising Query method to improve the generalization of the model for the biased object feature information by different color cast noise. Our U-DECN, with ResNet-50 backbone, achieves the best 64.0 AP on DUO and the best 58.1 AP on RUOD, and 21 FPS (5 times faster than Deformable DETR and DINO 4 FPS) on NVIDIA AGX Orin by TensorRT FP16, outperforming the other state-of-the-art query-based end-to-end object detectors. The code is available at https://github.com/LEFTeyex/U-DECN.

[189] ScribbleVS: Scribble-Supervised Medical Image Segmentation via Dynamic Competitive Pseudo Label Selection

Tao Wang, Xinlin Zhang, Zhenxuan Zhang, Yuanbo Zhou, Yuanbin Chen, Longxuan Zhao, Chaohui Xu, Shun Chen, Guang Yang, Tong Tong

Main category: cs.CV

TL;DR: ScribbleVS is a framework that uses scribble annotations for medical image segmentation, achieving results comparable to fully supervised models through noise-resistant pseudo-labeling techniques.

Details

Motivation: Medical image segmentation requires expensive pixel-level annotations. Scribble annotations are more cost-effective but challenging to use for training reliable models due to sparse supervision and noise in pseudo-labels.

Method: ScribbleVS framework with Regional Pseudo Labels Diffusion Module to expand supervision scope and reduce noise impact, plus Dynamic Competitive Selection module for refined pseudo-label selection.

Result: Experiments on ACDC, MSCMRseg, WORD, and BraTS2020 datasets show promising results with segmentation precision comparable to fully supervised models.

Conclusion: ScribbleVS effectively addresses the challenges of scribble-based medical image segmentation, providing a cost-effective alternative to fully supervised methods while maintaining high precision.

Abstract: In clinical medicine, precise image segmentation can provide substantial support to clinicians. However, obtaining high-quality segmentation typically demands extensive pixel-level annotations, which are labor-intensive and expensive. Scribble annotations offer a more cost-effective alternative by improving labeling efficiency. Nonetheless, using such sparse supervision for training reliable medical image segmentation models remains a significant challenge. Some studies employ pseudo-labeling to enhance supervision, but these methods are susceptible to noise interference. To address these challenges, we introduce ScribbleVS, a framework designed to learn from scribble annotations. We introduce a Regional Pseudo Labels Diffusion Module to expand the scope of supervision and reduce the impact of noise present in pseudo labels. Additionally, we introduce a Dynamic Competitive Selection module for enhanced refinement in selecting pseudo labels. Experiments conducted on the ACDC, MSCMRseg, WORD, and BraTS2020 datasets demonstrate promising results, achieving segmentation precision comparable to fully supervised models. The codes of this study are available at https://github.com/ortonwang/ScribbleVS.

[190] Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan

Main category: cs.CV

TL;DR: Physics Context Builders (PCBs) is a modular framework that uses specialized smaller VLMs to generate physical scene descriptions, enhancing larger VLMs’ physical reasoning capabilities without expensive continual fine-tuning.

Details

Motivation: VLMs struggle with physical reasoning due to inability to translate learned knowledge into predictions about physical behavior. Continual fine-tuning is expensive and impractical for large models.

Method: Fine-tune specialized smaller VLMs to generate detailed physical scene descriptions, then use these as physical contexts to enhance reasoning of larger VLMs. This separates visual perception from reasoning.

Result: PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. They also show strong Sim2Real transfer, generalizing from simulated to real-world scenes.

Conclusion: PCBs offer a modular and scalable solution for teaching VLMs about physical reasoning, enabling performance improvements and successful transfer from simulation to real-world scenarios.

Abstract: Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.

[191] Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

Enming Zhang, Peizhe Gong, Xingyuan Dai, Min Huang, Yisheng Lv, Qinghai Miao

Main category: cs.CV

TL;DR: SCD-Bench is a novel framework for evaluating safety cognition capabilities of vision-language models in autonomous driving, featuring automated assessment and a large training dataset that improves model performance.

Details

Motivation: Existing research focuses on conventional benchmarks rather than safety-critical evaluation for vision-language models in autonomous driving systems, creating a gap in safety cognition assessment.

Method: Developed SCD-Bench framework with ADA (semi-automated labeling system) refined by domain experts, automated LLM-based assessment pipeline, and created SCD-Training dataset with 324.35K samples.

Result: Automated assessment pipeline achieved over 98% agreement with human expert judgments. Models trained on SCD-Training showed significant improvements on both SCD-Bench and general/domain-specific benchmarks.

Conclusion: The framework offers a new perspective for enhancing safety-aware interactions in vision-language systems for autonomous driving, addressing the critical alignment challenge between VLMs and safety cognition.

Abstract: Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.

[192] Simulating Automotive Radar with Lidar and Camera Inputs

Peili Song, Dezhen Song, Yifan Yang, Enfan Lan, Jingtai Liu

Main category: cs.CV

TL;DR: A new method to simulate 4D millimeter wave radar signals using camera images, lidar point clouds, and ego-velocity, enabling high-fidelity radar signal generation for autonomous driving research.

Details

Motivation: The lack of quality datasets for low-cost millimeter automotive radar hinders research and development in autonomous driving, especially for adverse weather and lighting conditions.

Method: Uses two neural networks: DIS-Net to estimate spatial distribution and number of radar signals, and RSS-Net to predict radar signal strength based on appearance and geometric information from camera images and lidar point clouds.

Result: Successfully generated high-fidelity radar signals tested on open datasets from 3 commercial automotive radar models. Data augmentation with synthesized radar improved object detection neural network performance compared to using only raw radar data.

Conclusion: The method shows promise for facilitating future radar-based research and development in autonomous driving by enabling radar signal simulation and data augmentation.

Abstract: Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.

[193] Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, Xiao-Ping Zhang

Main category: cs.CV

TL;DR: Open3D-VQA is a novel benchmark for evaluating multimodal large language models’ spatial reasoning from aerial perspectives, featuring 73k QA pairs across 7 tasks and revealing key insights about model performance differences.

Details

Motivation: Spatial reasoning is fundamental for MLLMs but their performance in open aerial environments remains underexplored, necessitating a specialized benchmark.

Method: Created a benchmark with 73k QA pairs spanning 7 spatial reasoning tasks, automatically generated from spatial relations extracted from real-world and simulated aerial scenes, supporting both visual and point cloud modalities.

Result: Evaluation of 13 MLLMs showed: 1) Better performance on relative spatial relations than absolute distances, 2) 3D LLMs don’t significantly outperform 2D LLMs, 3) Fine-tuning on simulated data improves real-world performance.

Conclusion: The benchmark reveals important limitations in current MLLMs’ aerial spatial reasoning and provides tools for future research advancement.

Abstract: Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs’ ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model’s spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

[194] L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

Ziwei Shi, Xiaoran Zhang, Wenjing Xu, Yan Xia, Yu Zang, Siqi Shen, Cheng Wang

Main category: cs.CV

TL;DR: L2RSI enables LiDAR place recognition using remote sensing imagery as map proxies, achieving 83.27% accuracy within 30m radius for top-1 retrieval in 100km² range without prior 3D maps.

Details

Motivation: To overcome the high cost and time requirements of traditional LiDAR-based place recognition that depends on prior 3D maps, by leveraging readily available overhead remote sensing imagery as cost-effective map proxies.

Method: Proposes L2RSI method that learns feature alignment between LiDAR point cloud submaps and remote sensing submaps in semantic domain, and introduces probability propagation based on particle estimation to refine position predictions using temporal and spatial information.

Result: On LiRSI-XA dataset (110K remote sensing submaps, 13K LiDAR submaps), achieves 83.27% accuracy within 30m radius for top-1 retrieved location in 100km² retrieval range, enabling large-scale retrieval and cross-scene generalization without fine-tuning.

Conclusion: L2RSI provides a cost-effective solution for large-scale LiDAR place recognition by using remote sensing imagery as map proxies, demonstrating strong performance in cross-view and cross-modal localization with practical deployment capabilities.

Abstract: We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.

[195] DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment

Weizhi Chen, Yupeng Deng, Jin Wei, Jingbo Chen, Jiansheng Chen, Yuman Feng, Zhihao Xi, Diyou Liu, Kai Li, Yu Meng

Main category: cs.CV

TL;DR: DGTRS-CLIP is a dual-granularity curriculum learning framework that combines short and long text supervision for better remote sensing image-text alignment, outperforming existing methods on zero-shot tasks.

Details

Motivation: Existing CLIP-based vision language models for remote sensing rely on short text captions that provide incomplete semantic representations, while long captions are difficult to process due to limited text-encoding capacity and lack of aligned datasets.

Method: Proposed DGTRSD dataset with dual-granularity image-text pairs (short + long captions) and DGTRS-CLIP framework using curriculum learning to combine both text granularities for semantic alignment.

Result: Extensive experiments on four zero-shot tasks (long/short text cross-modal retrieval, image classification, semantic localization) show DGTRS-CLIP consistently outperforms existing methods.

Conclusion: The dual-granularity approach with curriculum learning effectively addresses the limitations of single-granularity text supervision in remote sensing vision-language models.

Abstract: Vision Language Foundation Models based on CLIP architecture for remote sensing primarily rely on short text captions, which often result in incomplete semantic representations. Although longer captions convey richer information, existing models struggle to process them effectively because of limited text-encoding capacity, and there remains a shortage of resources that align remote sensing images with both short text and long text captions. To address this gap, we introduce DGTRSD, a dual-granularity remote sensing image-text dataset, where each image is paired with both a short text caption and a long text description, providing a solid foundation for dual-granularity semantic modeling. Based on this, we further propose DGTRS-CLIP, a dual-granularity curriculum learning framework that combines short text and long text supervision to achieve dual-granularity semantic alignment. Extensive experiments on four typical zero-shot tasks: long text cross-modal retrieval, short text cross-modal retrieval, image classification, and semantic localization demonstrate that DGTRS-CLIP consistently outperforms existing methods across all tasks. The code has been open-sourced and is available at https://github.com/MitsuiChen14/DGTRS.

[196] DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model

Zhanwen Liu, Sai Zhou, Yuchao Dai, Yang Wang, Yisheng An, Xiangmo Zhao

Main category: cs.CV

TL;DR: DPMambaIR is a novel All-in-One image restoration framework that uses fine-grained degradation extraction and a Degradation-Aware Prompt State Space Model to handle multiple degradation types in a single model, achieving state-of-the-art performance.

Details

Motivation: Existing All-in-One image restoration approaches lack fine-grained modeling of degradation information and struggle with balancing multi-task conflicts, limiting their effectiveness across diverse degradation types.

Method: Proposes DPMambaIR framework with: 1) Fine-grained degradation extractor to capture detailed degradation features, 2) Degradation-Aware Prompt State Space Model (DP-SSM) that incorporates degradation features as dynamic prompts into state space modeling, 3) Complementary High-Frequency Enhancement Block (HEB) to recover local high-frequency details.

Result: Achieved best performance on mixed dataset with seven degradation types: 27.69dB PSNR and 0.893 SSIM, demonstrating superior restoration quality compared to existing methods.

Conclusion: DPMambaIR shows significant potential as a unified solution for All-in-One image restoration, effectively handling diverse degradation types through fine-grained degradation modeling and enhanced state space processing.

Abstract: All-in-One image restoration aims to address multiple image degradation problems using a single model, offering a more practical and versatile solution compared to designing dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework that introduces a fine-grained degradation extractor and a Degradation-Aware Prompt State Space Model (DP-SSM). The DP-SSM leverages the fine-grained degradation features captured by the extractor as dynamic prompts, which are then incorporated into the state space modeling process. This enhances the model’s adaptability to diverse degradation types, while a complementary High-Frequency Enhancement Block (HEB) recovers local high-frequency details. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.

[197] MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance

Mengting Wei, Yante Li, Tuomas Varanka, Yan Jiang, Guoying Zhao

Main category: cs.CV

TL;DR: A video face reenactment method that integrates 3D face parametric model (FLAME) with latent diffusion framework to improve shape consistency and motion control in face generation.

Details

Motivation: To address limitations in existing video-based face generation approaches by improving shape consistency and motion control through 3D parametric modeling.

Method: Uses FLAME 3D face model to extract motion features and preserve face geometry. Enhances latent diffusion model with depth maps, normal maps, and rendering maps from FLAME sequences through a Geometric Guidance Encoder (GGE). Employs multi-layer feature fusion with self-attention mechanisms.

Result: Generates high-quality face animations with precise expression and head pose variation modeling. Shows strong generalization performance on out-of-domain images.

Conclusion: The integration of 3D parametric face modeling with diffusion framework successfully achieves improved shape consistency and motion control in video face reenactment.

Abstract: In this study, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This not only enables precise extraction of motion features from driving videos, but also contributes to the faithful preservation of face shape and geometry. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. These maps serve as motion guidance and are encoded into the denoising UNet through a specifically designed Geometric Guidance Encoder (GGE). A multi-layer feature fusion module with integrated self-attention mechanisms is used to combine facial appearance and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at https://github.com/weimengting/MagicPortrait.

[198] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

Main category: cs.CV

TL;DR: UnifiedReward-Think is a multimodal reward model that incorporates explicit long chains of thought reasoning to improve reward signal reliability and robustness for vision tasks.

Details

Motivation: Current multimodal reward models provide limited reasoning depth, leading to inaccurate reward signals. The authors believe that incorporating explicit long chains of thought reasoning can significantly strengthen reliability and robustness.

Method: Three-stage approach: (1) Use GPT-4o reasoning distillation for cold start learning of CoT format, (2) Prepare large-scale multimodal preference data to elicit reasoning across vision tasks, (3) Use GRPO-based reinforcement fine-tuning with correct samples for refinement and incorrect samples for exploration.

Result: Extensive experiments demonstrate the superiority of the model across various vision reward tasks.

Conclusion: The proposed UnifiedReward-Think model successfully incorporates explicit long chains of thought reasoning, improving both direct response accuracy and overall reward signal reliability for multimodal vision tasks.

Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model’s latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model’s cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model’s prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model’s reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

[199] FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei

Main category: cs.CV

TL;DR: FSDrive introduces a visual spatio-temporal chain-of-thought framework that enables Vision-Language-Action models to think in images rather than text, improving autonomous driving planning by generating future frames with physical priors.

Details

Motivation: Current VLAs use textual chains-of-thought that blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. Symbolic compressions lose important visual information needed for safe driving.

Method: Proposes FSDrive with unified pre-training that expands vocabulary to include visual tokens. Uses progressive easy-to-hard scheme: first predicts lane/box priors for physical constraints, then completes full future frames. Model acts as world model to generate unified future frames with physical priors, then as inverse-dynamics model for trajectory planning.

Result: Improves trajectory accuracy and reduces collisions on nuScenes and NAVSIM under ST-P3 and UniAD metrics. Achieves competitive FID for future-frame generation despite lightweight autoregression. Advances scene understanding on DriveLM.

Conclusion: Visual chain-of-thought narrows the cross-modal gap and yields safer, more anticipatory planning in autonomous driving systems.

Abstract: Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.

[200] InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Tianchi Xie, Minzhi Lin, Mengchen Liu, Yilin Ye, Changjian Chen, Shixia Liu

Main category: cs.CV

TL;DR: InfoChartQA is a new benchmark for evaluating multimodal LLMs on infographic chart understanding, featuring 5,642 paired infographic and plain charts with visual-element-based questions that reveal significant performance gaps.

Details

Motivation: Existing visual-question answering benchmarks lack paired plain charts and visual-element-based questions needed to properly evaluate MLLMs' capabilities in understanding infographic charts with design elements like pictograms and icons.

Method: Created InfoChartQA benchmark with 5,642 pairs of infographic and plain charts sharing the same underlying data but different visual presentations, plus visual-element-based questions to capture unique visual designs and communicative intent.

Result: Evaluation of 20 MLLMs shows substantial performance decline on infographic charts, especially for visual-element-based questions related to metaphors. Paired charts enable fine-grained error analysis.

Conclusion: InfoChartQA highlights new opportunities for advancing MLLMs in infographic chart understanding and reveals current limitations in handling visual design elements and metaphors.

Abstract: Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at https://github.com/CoolDawnAnt/InfoChartQA.

[201] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, Wenjun Wu

Main category: cs.CV

TL;DR: HF-VTON is a novel virtual try-on framework that addresses pose consistency challenges through three specialized modules for spatial alignment, semantic representation, and appearance generation, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: Existing virtual try-on methods struggle with maintaining consistency across different poses, suffering from geometric distortions, semantic mismatches in garment structure/texture, and loss of fine-grained details that reduce visual fidelity.

Method: HF-VTON uses three key modules: (1) APWAM for pose-aware garment alignment and spatial consistency, (2) SRCM for capturing fine-grained garment attributes and multi-pose semantic representation, and (3) MPAGM for multimodal feature integration and prior-guided appearance generation. Also introduces SAMP-VTONS dataset with multi-pose pairs and textual annotations.

Result: HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS datasets, demonstrating superior performance in visual fidelity, semantic consistency, and detail preservation across diverse poses.

Conclusion: The proposed HF-VTON framework effectively addresses pose consistency challenges in virtual try-on through its three-module architecture and comprehensive dataset, achieving high-fidelity results that maintain both geometric and semantic consistency across different poses.

Abstract: Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.

[202] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu

Main category: cs.CV

TL;DR: Re-ttention is a sparse attention method for Diffusion Transformers that achieves high sparsity (3.1% tokens) while maintaining visual quality by leveraging temporal redundancy and reshaping attention scores based on prior softmax distribution history.

Details

Motivation: Attention mechanism in DiTs has quadratic complexity with resolution and video length, creating computational bottlenecks. Existing sparse attention techniques fail to preserve visual quality at high sparsity levels and may add computational overhead.

Method: Re-ttention leverages temporal redundancy in Diffusion Models to overcome probabilistic normalization shift. It reshapes attention scores based on prior softmax distribution history to maintain visual quality at very high sparsity levels.

Result: Experimental results show Re-ttention requires only 3.1% of tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference on T2V/T2I models such as CogVideoX and PixArt DiTs.

Conclusion: Re-ttention successfully enables very high sparse attention for visual generation models while preserving the visual quality of full quadratic attention, addressing the computational bottleneck in Diffusion Transformers.

Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

[203] Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

Lucas Piper, Arlindo L. Oliveira, Tiago Marques

Main category: cs.CV

TL;DR: EVNets combine V1-mimicking VOneBlock with novel SubcorticalBlock to improve CNN robustness and biological alignment without explicit optimization.

Details

Motivation: Standard CNNs remain vulnerable to visual perturbations and domain shifts compared to biological vision, despite high task performance.

Method: Developed Early Vision Networks (EVNets) - hybrid CNNs combining VOneBlock (primate V1 mimic) with SubcorticalBlock parameterized from neuroscience models to align with subcortical responses.

Result: EVNets improved V1 alignment, better modeled extra-classical receptive fields, showed stronger shape bias, and outperformed base CNN by 9.3% on robustness benchmarks. When combined with data augmentation, surpassed augmentation-only approach by 6.2%.

Conclusion: Architectural changes mimicking biology and training-based approaches provide complementary benefits for improving neural network robustness.

Abstract: Convolutional neural networks (CNNs) trained on object recognition achieve high task performance but continue to exhibit vulnerability under a range of visual perturbations and out-of-domain images, when compared with biological vision. Prior work has demonstrated that coupling a standard CNN with a front-end (VOneBlock) that mimics the primate primary visual cortex (V1) can improve overall model robustness. Expanding on this, we introduce Early Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock with a novel SubcorticalBlock, whose architecture draws from computational models in neuroscience and is parameterized to maximize alignment with subcortical responses reported across multiple experimental studies. Without being optimized to do so, the assembly of the SubcorticalBlock with the VOneBlock improved V1 alignment across most standard V1 benchmarks, and better modeled extra-classical receptive field phenomena. In addition, EVNets exhibit stronger emergent shape bias and outperform the base CNN architecture by 9.3% on an aggregate benchmark of robustness evaluations, including adversarial perturbations, common corruptions, and domain shifts. Finally, we show that EVNets can be further improved when paired with a state-of-the-art data augmentation technique, surpassing the performance of the isolated data augmentation approach by 6.2% on our robustness benchmark. This result reveals complementary benefits between changes in architecture to better mimic biology and training-based machine learning approaches.

[204] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: The paper proposes DeepVideo-R1, a video LLM trained with Reg-GRPO and difficulty-aware data augmentation to address limitations of GRPO in video reasoning tasks.

Details

Motivation: GRPO has shown success in enhancing reasoning capabilities of LLMs, but its effectiveness in VideoLLMs is understudied, with identified problems including reliance on safeguards and vanishing advantage.

Method: Proposes Reg-GRPO which reformulates GRPO loss as regression to predict advantages directly, eliminating safeguards, and uses difficulty-aware data augmentation to create diverse reward signals.

Result: Experimental results show significant improvement in video reasoning performance across multiple benchmarks.

Conclusion: The proposed approach effectively addresses GRPO limitations in video reasoning tasks and enhances VideoLLM performance.

Abstract: Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.

[205] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

Jianing Chen, Zehao Li, Yujun Cai, Hao Jiang, Chengxuan Qian, Juyuan Kang, Shuqin Gao, Honglong Zhao, Tianlu Mao, Yucheng Zhang

Main category: cs.CV

TL;DR: HAIF-GS is a dynamic 3D scene reconstruction framework that uses sparse anchor-driven deformation to achieve structured and temporally consistent motion modeling from monocular videos, addressing limitations in existing 3D Gaussian Splatting methods.

Details

Motivation: Extending 3D Gaussian Splatting to dynamic scenes is challenging due to redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations, which hinder coherent and efficient dynamic reconstruction.

Method: Proposes a unified framework with three key components: Anchor Filter to identify motion-relevant regions, self-supervised Induced Flow-Guided Deformation module for anchor motion, and Hierarchical Anchor Propagation mechanism for fine-grained deformations based on motion complexity.

Result: Extensive experiments on synthetic and real-world benchmarks show HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.

Conclusion: HAIF-GS successfully addresses the core challenges in dynamic 3D scene reconstruction by providing structured and consistent motion modeling through sparse anchor-driven deformation, achieving superior performance across multiple metrics.

Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppress redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.

[206] FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn

Main category: cs.CV

TL;DR: FOCUS is a training-free visual cropping method that uses MLLM-internal representations to efficiently find relevant image regions for fine-grained VQA, outperforming existing methods in accuracy and efficiency.

Details

Motivation: Current visual cropping methods for fine-grained VQA have limitations including need for task-specific fine-tuning, inefficient exhaustive search, and incompatibility with efficient attention implementations.

Method: Four-step approach: identify target objects in VQA prompt, compute object relevance map using KV cache, propose and rank relevant image regions, perform VQA using top-ranked region.

Result: Achieves strong performance across 4 fine-grained VQA datasets and 3 MLLM types, outperforms 3 popular visual cropping methods in accuracy and efficiency, matches ZoomEye performance with 3-6.5x less compute.

Conclusion: FOCUS provides an effective training-free solution for fine-grained VQA that leverages MLLM-internal representations to guide efficient region search.

Abstract: While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

[207] MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction

Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, Maks Ovsjanikov

Main category: cs.CV

TL;DR: MILo is a novel Gaussian Splatting framework that differentiably extracts surface meshes directly from 3D Gaussians during training, enabling high-quality mesh reconstruction with significantly fewer vertices than previous methods.

Details

Motivation: Current Gaussian Splatting methods require costly post-processing to extract surface meshes, resulting in loss of fine geometric details, dense meshes with millions of vertices, and limitations in preserving geometric structures captured during training.

Method: MILo introduces a fully differentiable procedure that constructs mesh (vertices and connectivity) at every iteration directly from Gaussian parameters. Key contributions include: bidirectional consistency framework, adaptive mesh extraction using Gaussians as differentiable pivots for Delaunay triangulation, and novel signed distance computation from 3D Gaussians.

Result: The approach reconstructs complete scenes with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. The resulting meshes are lightweight with empty interiors, making them suitable for downstream applications like physics simulations and animation.

Conclusion: MILo successfully bridges the gap between volumetric and surface representations in Gaussian Splatting by enabling differentiable mesh extraction during training, preserving geometric details while producing efficient, high-quality meshes for practical applications.

Abstract: While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.

[208] Diverse Teaching and Label Propagation for Generic Semi-Supervised Medical Image Segmentation

Wei Li, Pengcheng Zhou, Linye Ma, Wenyi Zhao, Huihua Yang, Yuchen Guo

Main category: cs.CV

TL;DR: DTLP-Net is a generic framework for semi-supervised medical image segmentation that addresses limited annotation and domain shift through diverse teacher models and label propagation, achieving state-of-the-art performance across SSMIS, UMDA, and Semi-MDG tasks.

Details

Motivation: To overcome the challenges of limited annotation and domain shift in medical image segmentation, which lead to suboptimal performance in semi-supervised scenarios, and to develop a unified framework that can handle multiple related tasks instead of task-specific solutions.

Method: Uses DTLP-Net with one student model and two diverse teacher models. First teacher decouples training with labeled/unlabeled data, second teacher is momentum-updated for diverse pseudo-labels. Employs inter-sample and intra-sample data augmentation, and label propagation for voxel-level correlations.

Result: Achieves notable improvements compared to state-of-the-art methods across five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks, demonstrating superior performance in all settings.

Conclusion: The proposed framework shows strong potential for tackling challenging semi-supervised learning scenarios in medical image segmentation by effectively generating reliable pseudo-labels and increasing model diversity.

Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.

[209] InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes

Hongyuan Liu, Haochen Yu, Bochao Zou, Jianfei Jiang, Qiankun Liu, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: InstDrive is an instance-aware 3D Gaussian Splatting framework for dynamic driving scene reconstruction that achieves 3D instance segmentation without pre-processed instance IDs or complex pipelines.

Details

Motivation: Current methods unify all background elements into single representations, hindering instance-level understanding and flexible scene editing. Existing approaches rely on pre-processed instance IDs or complex pipelines and are designed for indoor scenes, making them less applicable to outdoor driving scenarios.

Method: Uses SAM-generated masks as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. Introduces regularization to implicitly encode instance identities and enforces consistency through voxel-based loss. Uses lightweight static codebook to bridge continuous features and discrete identities without data pre-processing.

Result: Quantitative and qualitative experiments demonstrate effectiveness. First framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.

Conclusion: InstDrive successfully addresses the limitations of existing methods by providing instance-aware reconstruction for dynamic driving scenes without requiring complex pre-processing or optimization pipelines.

Abstract: Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving scenes.More visualizations are available at our project page.

[210] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation

Tianhao Guo, Bingjie Lu, Feng Wang, Zhengyang Lu

Main category: cs.CV

TL;DR: A distance-adaptive super-resolution framework that addresses spatially-varying degradation in real-world imaging through variational formulation with depth-dependent spectral analysis and neural implementation.

Details

Motivation: Traditional super-resolution assumes spatially-invariant degradation, but real-world imaging exhibits complex distance-dependent effects like atmospheric scattering and depth-of-field variations, requiring spatially-adaptive reconstruction strategies.

Method: Variational framework with spatially-varying inverse problem formulation using pseudodifferential operators with distance-dependent spectral characteristics. Neural architecture implements discrete gradient flow dynamics with depth-conditional convolution kernels and learned distance-adaptive regularization.

Result: State-of-the-art performance with 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2× and 4× scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively.

Conclusion: Establishes the first theoretically-grounded distance-adaptive super-resolution framework with significant improvements on depth-variant scenarios while maintaining competitive performance on traditional benchmarks.

Abstract: Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.

[211] SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

Main category: cs.CV

TL;DR: SignClip improves sign language translation by fusing manual (hand gestures) and non-manual (lip movements) cues using hierarchical contrastive learning with multi-level alignment.

Details

Motivation: Most existing sign language translation approaches focus only on manual signals and overlook non-manual cues like mouthing, which convey essential linguistic information and help disambiguate visually similar signs.

Method: Proposes SignClip framework that fuses spatial gesture and lip movement features, and introduces hierarchical contrastive learning with multi-level alignment objectives across sign-lip and visual-text modalities.

Result: On PHOENIX14T dataset in Gloss-free setting, SignClip surpasses previous state-of-the-art SpaMo, improving BLEU-4 from 24.32 to 24.71 and ROUGE from 46.57 to 48.38. Similar improvements shown on How2Sign dataset.

Conclusion: Fusing manual and non-manual cues with hierarchical contrastive learning significantly improves sign language translation accuracy, demonstrating the importance of incorporating mouthing information for better SLT performance.

Abstract: Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

[212] VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment

Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George

Main category: cs.CV

TL;DR: VLCE is a multimodal system that generates comprehensive disaster damage descriptions from satellite and UAV imagery using dual CNN-LSTM and Vision Transformer architectures with external semantic knowledge.

Details

Motivation: Traditional manual damage assessment after disasters is slow and dangerous, while current computer vision methods only provide limited classification/segmentation outputs without comprehensive situational understanding.

Method: Dual-architecture approach: CNN-LSTM with ResNet50 backbone for satellite imagery (xBD dataset) and Vision Transformer for UAV imagery (RescueNet dataset), enhanced with external semantic knowledge from ConceptNet and WordNet.

Result: VLCE significantly outperforms baseline models (LLaVA and QwenVL), achieving up to 95.33% on InfoMetIC for caption informativeness while maintaining competitive semantic alignment measured by CLIPScore.

Conclusion: The dual-architecture system shows significant potential for improving disaster damage assessment by automating the generation of actionable, information-dense descriptions from aerial imagery.

Abstract: Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.

[213] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham

Main category: cs.CV

TL;DR: TGC is a semi-supervised semantic segmentation framework that uses graph-theoretic constraints to enforce global topology, improving segmentation accuracy in computational pathology with limited supervision.

Details

Motivation: Existing semi-supervised methods rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks in computational pathology where dense annotations are costly.

Method: Proposes Topology Graph Consistency (TGC) framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references.

Result: Achieves state-of-the-art performance on GlaS and CRAG datasets under 5-10% supervision and significantly narrows the gap to full supervision.

Conclusion: TGC effectively enforces global topology and improves segmentation accuracy in semi-supervised settings for computational pathology.

Abstract: Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision.

[214] Activation Matching for Explanation Generation

Pirzada Suhail, Aditya Anand, Amit Sethi

Main category: cs.CV

TL;DR: An activation-matching approach generates minimal binary masks that preserve both model predictions and intermediate activations, providing faithful explanations for classifier decisions.

Details

Motivation: To create minimal, faithful explanations for pretrained classifiers that preserve both the model's prediction and internal activations while being human-interpretable.

Method: Train a lightweight autoencoder to output binary masks using multi-layer activation matching (KL divergence + cross-entropy), mask priors (L1 area, binarization penalty, total variation), and abductive constraints for faithfulness.

Result: Produces small, crisp binary masks that retain classifier behavior while discarding irrelevant regions, yielding practical minimalist explanations.

Conclusion: The proposed activation-matching framework successfully generates minimal yet faithful explanations that preserve model decision-making while being interpretable to humans.

Abstract: In this paper we introduce an activation-matching–based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model’s prediction and the intermediate activations of (x). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors – L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

[215] MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation

T-Mai Bui, Fares Bougourzi, Fadi Dornaika, Vinh Truong Hoang

Main category: cs.CV

TL;DR: Proposes a hybrid medical image segmentation architecture with three-branch encoder (CNNs, Transformers, Mamba-based Attention Fusion) and multi-scale attention-based CNN decoder to capture local, global, and long-range dependencies while maintaining computational efficiency.

Details

Motivation: Address limitations of existing task-specific models with varying performance across modalities and anatomical regions, and the challenge of balancing model complexity and performance in clinical settings where both accuracy and efficiency are critical.

Method: Hybrid architecture with three-branch encoder integrating CNNs, Transformers, and Mamba-based Attention Fusion (MAF) mechanism; multi-scale attention-based CNN decoder; co-attention gate for enhanced feature selection across scales during encoding and decoding.

Result: Extensive experiments on multiple benchmark datasets show the approach outperforms state-of-the-art methods in accuracy and generalization while maintaining comparable computational complexity.

Conclusion: The architecture effectively balances efficiency and effectiveness, offering a practical and scalable solution for diverse medical imaging tasks, with source code and models to be publicly released.

Abstract: In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.

[216] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Junhua Huang, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Main category: cs.CV

TL;DR: This survey provides the first comprehensive examination of post-training methodologies for Video-Large Multimodal Models (Video-LMMs), covering supervised fine-tuning, reinforcement learning, and test-time scaling techniques.

Details

Motivation: Video understanding is challenging due to complex spatiotemporal relationships and long-term dependencies. While Video-LMMs show promise, their post-training phase remains fragmented in literature, limiting their transformation from basic perception to sophisticated reasoning systems.

Method: The survey examines three fundamental post-training pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. It presents a structured taxonomy addressing video-specific challenges.

Result: The survey synthesizes key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. It also curates essential benchmarks, datasets, and metrics for rigorous assessment.

Conclusion: This survey provides researchers and practitioners with a unified framework for advancing Video-LMM capabilities through systematic post-training methodologies, addressing unique video understanding challenges and facilitating future development in the field.

Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

[217] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang

Main category: cs.CV

TL;DR: Pixel-Perfect Depth is a monocular depth estimation model that uses pixel-space diffusion generation to create high-quality depth maps without flying pixel artifacts, outperforming existing generative models on multiple benchmarks.

Details

Motivation: Current generative depth estimation models use VAE compression which introduces flying pixels at edges and details, degrading the quality of depth maps and point clouds.

Method: Uses pixel-space diffusion generation instead of latent space, with two key designs: Semantics-Prompted Diffusion Transformers (SP-DiT) that incorporate semantic representations, and Cascade DiT Design that progressively increases tokens for efficiency.

Result: Achieves best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

Conclusion: Pixel-space diffusion generation with semantic prompting and cascade design effectively eliminates flying pixel artifacts and produces superior depth estimation results compared to latent-space approaches.

Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

[218] Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models

Jianbiao Mei, Yu Yang, Xuemeng Yang, Licheng Wen, Jiajun Lv, Botian Shi, Yong Liu

Main category: cs.CV

TL;DR: IR-WM is an Implicit Residual World Model that focuses on modeling current state and world evolution by predicting only residual changes rather than full scene reconstruction, improving autonomous driving performance.

Details

Motivation: Current vision-centric world models in autonomous driving inefficiently reconstruct entire future scenes, wasting capacity on static backgrounds. The goal is to focus modeling resources on dynamic changes and world evolution.

Method: IR-WM first creates BEV representations from visual observations, uses previous BEV features as temporal priors, predicts only residual changes conditioned on ego-vehicle actions and scene context, and applies an alignment module to correct semantic and dynamic misalignments.

Result: On nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning. The implicit future state generated by world models substantially improves planning accuracy.

Conclusion: The proposed residual modeling approach effectively focuses computational resources on dynamic changes rather than static backgrounds, leading to superior performance in autonomous driving tasks.

Abstract: End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird’s-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the “residual”, i.e., the changes conditioned on the ego-vehicle’s actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.

[219] WaMaIR: Image Restoration via Multiscale Wavelet Convolutions and Mamba-based Channel Modeling with Texture Enhancement

Shengyu Zhu, Congyi Fan, Fuxuan Zhang

Main category: cs.CV

TL;DR: WaMaIR is a novel CNN-based image restoration framework that uses wavelet transforms to expand receptive fields and Mamba-based modules to capture long-range channel dependencies, achieving superior texture detail restoration with computational efficiency.

Details

Motivation: Previous CNN-based image restoration methods struggle with restoring fine texture details due to limited receptive fields and lack of channel feature modeling, which restricts their ability to capture comprehensive image features.

Method: Proposes three key components: 1) Global Multiscale Wavelet Transform Convolutions (GMWTConvs) to expand receptive field and preserve texture features, 2) Mamba-Based Channel-Aware Module (MCAM) to capture long-range dependencies within feature channels, and 3) Multiscale Texture Enhancement Loss (MTELoss) to guide texture structure preservation.

Result: Extensive experiments show WaMaIR outperforms state-of-the-art methods in image restoration quality while maintaining efficient computational performance.

Conclusion: WaMaIR effectively addresses the limitations of previous CNN-based methods by combining wavelet transforms for large receptive fields and Mamba-based modules for channel modeling, achieving superior texture detail restoration in image restoration tasks.

Abstract: Image restoration is a fundamental and challenging task in computer vision, where CNN-based frameworks demonstrate significant computational efficiency. However, previous CNN-based methods often face challenges in adequately restoring fine texture details, which are limited by the small receptive field of CNN structures and the lack of channel feature modeling. In this paper, we propose WaMaIR, which is a novel framework with a large receptive field for image perception and improves the reconstruction of texture details in restored images. Specifically, we introduce the Global Multiscale Wavelet Transform Convolutions (GMWTConvs) for expandding the receptive field to extract image features, preserving and enriching texture features in model inputs. Meanwhile, we propose the Mamba-Based Channel-Aware Module (MCAM), explicitly designed to capture long-range dependencies within feature channels, which enhancing the model sensitivity to color, edges, and texture information. Additionally, we propose Multiscale Texture Enhancement Loss (MTELoss) for image restoration to guide the model in preserving detailed texture structures effectively. Extensive experiments confirm that WaMaIR outperforms state-of-the-art methods, achieving better image restoration and efficient computational performance of the model.

Zelin Peng, Zhengqin Xu, Qingyang Liu, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: HyperET is an efficient training paradigm for MLLMs that uses hyperbolic space to align visual and textual modalities at arbitrary granularity levels through dynamic radius adjustment, achieving significant improvements with minimal parameter overhead.

Details

Motivation: Current MLLMs require massive computational resources due to vision encoders like CLIP and SAM lacking multi-granularity alignment with language. Hyperbolic space naturally models hierarchical levels, providing a principled solution to bridge this granularity gap.

Method: HyperET optimizes visual representations using hyperbolic space with dynamic radius adjustment. It employs learnable matrices with Möbius multiplication operations through three configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices for flexible parametrization.

Result: Comprehensive experiments across multiple MLLM benchmarks show HyperET consistently improves both pre-training and fine-tuning MLLMs with less than 1% additional parameters.

Conclusion: HyperET provides an efficient training paradigm that effectively addresses the multi-granularity alignment challenge in MLLMs using hyperbolic space, achieving substantial performance gains with minimal computational overhead.

Abstract: Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with M"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1% additional parameters.

[221] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

Main category: cs.CV

TL;DR: NoisyGRPO is a multimodal RL framework that improves Chain-of-Thought reasoning in MLLMs by injecting Gaussian noise into visual inputs and using Bayesian advantage estimation to enhance generalization and robustness.

Details

Motivation: Existing RL frameworks for improving general CoT reasoning in MLLMs struggle with generalization beyond training distributions, limiting their practical effectiveness.

Method: Proposes NoisyGRPO with two key components: (1) Noise-Injected Exploration Policy that perturbs visual inputs with Gaussian noise, and (2) Bayesian Advantage Estimation that formulates advantage estimation as Bayesian inference using noise level as prior and trajectory reward as likelihood.

Result: Experiments show NoisyGRPO substantially improves generalization and robustness on CoT quality, general capability, and hallucination benchmarks, especially with small-scale MLLMs like Qwen2.5-VL 3B.

Conclusion: NoisyGRPO effectively addresses generalization limitations in RL-based CoT reasoning by combining controlled noise injection with principled Bayesian advantage estimation, leading to more robust multimodal reasoning capabilities.

Abstract: Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

[222] PSScreen V2: Partially Supervised Multiple Retinal Disease Screening

Boyi Zheng, Yalin Zheng, Hrvoje Bogunović, Qing Liu

Main category: cs.CV

TL;DR: PSScreen V2 is a partially supervised self-training framework for multiple retinal disease screening that addresses label absence and domain shift using a three-branch architecture with novel feature augmentation strategies.

Details

Motivation: To overcome limitations of previous methods that require fully labelled datasets or work with single-domain data, by learning from multiple partially labelled datasets with different distributions.

Method: Three-branch architecture with teacher and two student networks. Teacher generates pseudo labels from weakly augmented images. Students use LF-Dropout (randomly discarding domain-related low-frequency components) and LF-Uncert (adversarially learned Gaussian perturbations of low-frequency statistics) for feature augmentation.

Result: Achieves state-of-the-art performance and superior domain generalization on multiple fundus datasets. Compatible with diverse backbones including DINOv2, and shows universality on chest X-ray datasets.

Conclusion: PSScreen V2 provides an effective framework for partially supervised learning with strong domain generalization capabilities, demonstrating broad applicability across medical imaging domains.

Abstract: In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at https://github.com/boyiZheng99/PSScreen_V2.

[223] FastJAM: a Fast Joint Alignment Model for Images

Omri Hirsch, Ron Shapira Weber, Shira Ifergane, Oren Freifeld

Main category: cs.CV

TL;DR: FastJAM is a rapid graph-based method for joint image alignment that reduces computational complexity from hours/minutes to seconds while achieving better alignment quality than existing methods.

Details

Motivation: Existing joint alignment approaches require long training times, large models, and extensive hyperparameter tuning, creating a need for faster and more efficient methods.

Method: Uses pairwise matches from off-the-shelf image matchers with nonparametric clustering to build a graph of keypoint relations, then employs graph neural networks to propagate correspondences and predict homography parameters via image-level pooling with inverse-compositional loss.

Result: Achieves better alignment quality than modern JA methods while reducing computation time from hours/minutes to seconds on multiple benchmarks.

Conclusion: FastJAM provides an efficient and effective solution for joint image alignment that eliminates the need for regularization terms and associated hyperparameter tuning, making it practical for real-world applications.

Abstract: Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, https://bgu-cs-vil.github.io/FastJAM/

[224] LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie

Main category: cs.CV

TL;DR: Efficient multimodal model fusion approach that combines existing generation and understanding models through interleaved multimodal self-attention blocks, achieving strong performance with minimal training.

Details

Motivation: To create competitive multimodal models more efficiently by fusing existing specialized models rather than training from scratch, reducing computational requirements.

Method: Retains original model blocks while interleaving multimodal self-attention blocks throughout networks, enabling rich multimodal fusion while preserving base model strengths.

Result: Achieves strong benchmarks with only ~35B tokens: 0.91 on GenEval, 82.16 on DPG-Bench, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench.

Conclusion: Demonstrates that strategic fusion of existing models can achieve competitive multimodal performance efficiently, with full code and models released to support future research.

Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

[225] Quantizing Space and Time: Fusing Time Series and Images for Earth Observation

Gianfranco Basile, Johannes Jakubik, Benedikt Blumenstiel, Thomas Brunschwiler, Juan Bernabe Moreno

Main category: cs.CV

TL;DR: A task-agnostic framework for multimodal fusion of time series and images using time series quantization and masked correlation learning, achieving superior performance in cross-modal generation and downstream tasks.

Details

Motivation: To enable robust multimodal fusion between time series data and single timestamp images in a task-agnostic manner, allowing cross-modal generation and improved downstream task performance.

Method: Proposes deterministic and learned strategies for time series quantization, then uses masked correlation learning to align discrete image and time series tokens in a unified representation space.

Result: Outperforms task-specific fusion by 6% in R² and 2% in RMSE on average, and exceeds baseline methods by 50% in R² and 12% in RMSE. Successfully generates consistent global temperature profiles from satellite imagery.

Conclusion: The framework enables effective cross-modal generation and robust downstream performance, with gradient sensitivity analysis providing insights into model robustness across modalities.

Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50% in R^2 and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.

[226] When are radiology reports useful for training medical image classifiers?

Herman Bergström, Zhongqi Yue, Fredrik D. Johansson

Main category: cs.CV

TL;DR: Systematic study shows leveraging radiology reports during training improves medical image classification when labels are well-represented in text, but can be detrimental otherwise. Fine-tuning with reports often provides greater benefits than pre-training methods.

Details

Motivation: Medical images often come with radiology reports containing expert annotations, but using these reports requires manual radiologist work. The research aims to determine when and how radiology reports can improve image-only classification during training.

Method: Conducted systematic study using radiology reports during both pre-training and fine-tuning phases across diagnostic and prognostic tasks (e.g., 12-month readmission), varying training set sizes, and comparing explicit image-text alignment approaches.

Result: 1) Report-based pre-training helps when labels are well-represented in text but can be detrimental otherwise. 2) Fine-tuning with reports often provides significant improvements and can have larger impact than pre-training methods in certain settings.

Conclusion: Provides actionable insights for leveraging privileged text data in medical image classification, highlighting specific conditions where reports are beneficial and identifying gaps in current research approaches.

Abstract: Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it’s not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.

cs.AI

[227] Scheduling Your LLM Reinforcement Learning with Reasoning Trees

Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, Jiawei Chen

Main category: cs.AI

TL;DR: The paper introduces a novel Reasoning Score (r-score) metric and Reasoning Tree Schedule (Re-Schedule) algorithm for RLVR data scheduling, which improves LLM optimization by considering reasoning tree structures rather than just path-based metrics.

Details

Motivation: Existing RLVR data scheduling methods rely on path-based metrics that overlook reasoning tree structures, limiting their effectiveness in optimizing LLMs through reinforcement learning.

Method: Proposes Reasoning Score (r-score) to measure query learning difficulty based on reasoning tree structure, and develops Re-Schedule algorithm that creates a curriculum from simple (high r-score) to complex (low r-score) queries.

Result: Experiments on six math-reasoning benchmarks show Re-Schedule significantly improves average accuracy with gains up to 3.2%, outperforming existing methods.

Conclusion: Structural understanding of reasoning trees provides a more powerful and principled foundation for RLVR data scheduling, as validated by strong experimental results.

Abstract: Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query’s `Reasoning Tree’. This process involves exploring nodes (tokens) and dynamically modifying the model’s policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query’s learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.

[228] Cyclic Counterfactuals under Shift-Scale Interventions

Saptarshi Saha, Dhruv Vansraj Rathore, Utpal Garain

Main category: cs.AI

TL;DR: This paper studies counterfactual inference in cyclic structural causal models (SCMs), extending traditional frameworks that assume acyclic DAGs to handle real-world systems with feedback loops and cyclic dependencies.

Details

Motivation: Traditional counterfactual inference frameworks assume acyclic SCMs (DAGs), but many real-world systems like biological systems contain feedback loops and cyclic dependencies that violate acyclicity assumptions.

Method: The authors study counterfactual inference in cyclic SCMs under shift-scale interventions, which are soft, policy-style changes that rescale and/or shift a variable’s mechanism.

Result: Not specified in the abstract provided.

Conclusion: Not specified in the abstract provided.

Abstract: Most counterfactual inference frameworks traditionally assume acyclic structural causal models (SCMs), i.e. directed acyclic graphs (DAGs). However, many real-world systems (e.g. biological systems) contain feedback loops or cyclic dependencies that violate acyclicity. In this work, we study counterfactual inference in cyclic SCMs under shift-scale interventions, i.e., soft, policy-style changes that rescale and/or shift a variable’s mechanism.

[229] Taming the Real-world Complexities in CPT E/M Coding with Large Language Models

Islam Nassar, Yang Lin, Yuan Jin, Rongxin Zhu, Chang Wei Tan, Zenan Zhai, Nitika Mathur, Thanh Tien Vu, Xu Zhong, Long Duong, Yuan-Fang Li

Main category: cs.AI

TL;DR: ProFees is an LLM-based framework that automates E/M coding to reduce physician documentation burden, achieving 36% higher accuracy than commercial systems and 5% over single-prompt baselines.

Details

Motivation: Automating E/M coding can alleviate physicians' documentation burden, improve billing efficiency, and enable better patient care by handling this auxiliary but important task.

Method: ProFees uses an LLM-based framework specifically designed to address real-world complexities in E/M coding that make automation challenging.

Result: On expert-curated real-world data, ProFees achieved more than 36% higher coding accuracy compared to commercial CPT E/M coding systems and almost 5% improvement over the strongest single-prompt baseline.

Conclusion: The ProFees framework effectively addresses real-world complexities in E/M coding automation, demonstrating significant improvements in accuracy over existing solutions.

Abstract: Evaluation and Management (E/M) coding, under the Current Procedural Terminology (CPT) taxonomy, documents medical services provided to patients by physicians. Used primarily for billing purposes, it is in physicians’ best interest to provide accurate CPT E/M codes. %While important, it is an auxiliary task that adds to physicians’ documentation burden. Automating this coding task will help alleviate physicians’ documentation burden, improve billing efficiency, and ultimately enable better patient care. However, a number of real-world complexities have made E/M encoding automation a challenging task. In this paper, we elaborate some of the key complexities and present ProFees, our LLM-based framework that tackles them, followed by a systematic evaluation. On an expert-curated real-world dataset, ProFees achieves an increase in coding accuracy of more than 36% over a commercial CPT E/M coding system and almost 5% over our strongest single-prompt baseline, demonstrating its effectiveness in addressing the real-world complexities.

[230] Aligning Large Language Models with Procedural Rules: An Autoregressive State-Tracking Prompting for In-Game Trading

Minkyung Kim, Junsik Kim, Woongcheol Yang, Sangdon Park, Sohee Bae

Main category: cs.AI

TL;DR: ASTP prompting method enables LLMs to follow procedural trading flows with high accuracy while maintaining computational efficiency.

Details

Motivation: LLMs fail to follow essential procedural flows in rule-governed trading systems, eroding player trust in dynamic game interactions.

Method: Autoregressive State-Tracking Prompting (ASTP) compels LLMs to make state-tracking explicit and verifiable, complemented by state-specific placeholder post-processing for accurate price calculations.

Result: Evaluation shows >99% state compliance and 99.3% calculation precision, with smaller models matching larger models’ performance while reducing response time from 21.2s to 2.4s.

Conclusion: ASTP establishes a practical foundation satisfying both real-time requirements and resource constraints of commercial games while maintaining transactional integrity.

Abstract: Large Language Models (LLMs) enable dynamic game interactions but fail to follow essential procedural flows in rule-governed trading systems, eroding player trust. This work resolves the core tension between the creative flexibility of LLMs and the procedural demands of in-game trading (browse-offer-review-confirm). To this end, Autoregressive State-Tracking Prompting (ASTP) is introduced, a methodology centered on a strategically orchestrated prompt that compels an LLM to make its state-tracking process explicit and verifiable. Instead of relying on implicit contextual understanding, ASTP tasks the LLM with identifying and reporting a predefined state label from the previous turn. To ensure transactional integrity, this is complemented by a state-specific placeholder post-processing method for accurate price calculations. Evaluation across 300 trading dialogues demonstrates >99% state compliance and 99.3% calculation precision. Notably, ASTP with placeholder post-processing on smaller models (Gemini-2.5-Flash) matches larger models’ (Gemini-2.5-Pro) performance while reducing response time from 21.2s to 2.4s, establishing a practical foundation that satisfies both real-time requirements and resource constraints of commercial games.

[231] Reasoning-Aware GRPO using Process Mining

Taekhyun Park, Yongjae Lee, Hyerim Bae

Main category: cs.AI

TL;DR: PM4GRPO enhances RL-based post-training for large reasoning models by incorporating process mining to measure reasoning procedure conformance with teacher models, outperforming existing GRPO methods.

Details

Motivation: Current reinforcement learning reward schemes for post-training large reasoning models are typically outcome-centric, lacking signals about the reasoning process itself.

Method: Proposes PM4GRPO that augments standard answer/format rewards with process mining techniques to compute conformance rewards measuring how closely policy model reasoning aligns with pretrained teacher models.

Result: Empirical results on five benchmarks show PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training.

Conclusion: Leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Abstract: Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model’s reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

[232] H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts

Peilin Tan, Liang Xie, Churan Zhi, Dian Tu, Chuanqi Shi

Main category: cs.AI

TL;DR: H3M-SSMoEs is a novel hypergraph-based multimodal architecture for stock prediction that integrates hypergraph modeling, LLM reasoning, and style-structured mixture of experts to capture complex temporal dependencies and inter-stock relationships.

Details

Motivation: Stock movement prediction is challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework.

Method: Three key innovations: (1) Multi-Context Multimodal Hypergraph with Local and Global Context Hypergraphs using shared cross-modal hyperedges and Jensen-Shannon Divergence weighting; (2) LLM-enhanced reasoning module with frozen LLM and lightweight adapters for semantic fusion; (3) Style-Structured Mixture of Experts combining shared market experts and industry-specialized experts with learnable style vectors.

Result: Extensive experiments on three major stock markets demonstrate superior predictive accuracy and investment performance compared to state-of-the-art methods, with effective risk control.

Conclusion: H3M-SSMoEs provides an effective framework for stock movement prediction that integrates structural, semantic, and regime-adaptive modeling, achieving state-of-the-art performance while maintaining scalability.

Abstract: Stock movement prediction remains fundamentally challenging due to complex temporal dependencies, heterogeneous modalities, and dynamically evolving inter-stock relationships. Existing approaches often fail to unify structural, semantic, and regime-adaptive modeling within a scalable framework. This work introduces H3M-SSMoEs, a novel Hypergraph-based MultiModal architecture with LLM reasoning and Style-Structured Mixture of Experts, integrating three key innovations: (1) a Multi-Context Multimodal Hypergraph that hierarchically captures fine-grained spatiotemporal dynamics via a Local Context Hypergraph (LCH) and persistent inter-stock dependencies through a Global Context Hypergraph (GCH), employing shared cross-modal hyperedges and Jensen-Shannon Divergence weighting mechanism for adaptive relational learning and cross-modal alignment; (2) a LLM-enhanced reasoning module, which leverages a frozen large language model with lightweight adapters to semantically fuse and align quantitative and textual modalities, enriching representations with domain-specific financial knowledge; and (3) a Style-Structured Mixture of Experts (SSMoEs) that combines shared market experts and industry-specialized experts, each parameterized by learnable style vectors enabling regime-aware specialization under sparse activation. Extensive experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control. Datasets, source code, and model weights are available at our GitHub repository: https://github.com/PeilinTime/H3M-SSMoEs.

[233] KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo

Main category: cs.AI

TL;DR: KnowCoder-A1 is an LLM that performs autonomous agentic reasoning on Knowledge Bases using outcome-only supervision with multi-stage curriculum reinforcement learning, achieving superior performance with less training data.

Details

Motivation: Existing KBQA methods use process supervision which provides weak incentives for exploration and fails to strengthen agentic reasoning abilities in LLMs.

Method: Multi-stage curriculum reinforcement learning with easy-to-hard curriculum under outcome-only supervision, starting with fine-tuning on high-quality trajectories from rejection sampling.

Result: KnowCoder-A1 consistently outperforms prior approaches across three datasets, achieving up to 11.1% relative improvement on GrailQA zero-shot subset using only 1/12 of training data.

Conclusion: Outcome-only supervision with curriculum RL enables powerful autonomous agentic reasoning capabilities in LLMs for KBQA tasks.

Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp

Main category: cs.AI

TL;DR: ALDEN is a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents for actively navigating long, visually rich documents, achieving state-of-the-art performance on five benchmarks.

Details

Motivation: Vision-language models struggle with long, visually complex documents that require analysis across multiple pages, and existing approaches use rigid pipelines that force VLMs into passive roles, hindering efficiency and generalization.

Method: ALDEN uses a multi-turn reinforcement learning framework with a novel fetch action for direct page access, rule-based cross-level rewards for dense supervision, and visual-semantic anchoring to stabilize training with numerous visual tokens from long documents.

Result: ALDEN achieves state-of-the-art performance on five long-document benchmarks, trained on a corpus from three open-source datasets.

Conclusion: ALDEN represents a step beyond passive document reading toward autonomous agents that navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.

Abstract: Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.

[235] Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

Juan Ren, Mark Dras, Usman Naseem

Main category: cs.AI

TL;DR: Agentic Moderation is a model-agnostic framework that uses specialized agents to defend multimodal systems against jailbreak attacks, achieving significant improvements in safety metrics through dynamic, cooperative agent coordination.

Details

Motivation: To extend agentic methods to safety alignment by creating a more effective defense against jailbreak attacks in multimodal systems, moving beyond static approaches that only provide binary classifications.

Method: Uses a framework with four cooperative agents (Shield, Responder, Evaluator, and Reflector) that work dynamically to achieve context-aware and interpretable moderation, making it model-agnostic.

Result: Reduces Attack Success Rate by 7-19%, maintains stable Non-Following Rate, improves Refusal Rate by 4-20% across five datasets and four LVLMs, demonstrating robust and balanced safety performance.

Conclusion: Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, showcasing the broader potential of agentic systems for automated safety governance in multimodal AI systems.

Abstract: Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.

[236] Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision

Yuyang Xia, Zibo Liang, Liwei Deng, Yan Zhao, Han Su, Kai Zheng

Main category: cs.AI

TL;DR: EneAD is an energy-efficient autonomous driving framework that reduces perception computation by 1.9x-3.5x through adaptive model selection and framerate adjustment, improving driving range by 3.9%-8.5% while maintaining accuracy.

Details

Motivation: Autonomous driving's energy consumption, particularly from perception computing using large deep learning models, limits electric vehicle driving range. Existing compression techniques cause either large model sizes or significant accuracy drops.

Method: Uses adaptive perception with multiple models of different computational costs, dynamically adjusts execution framerate, and employs Bayesian optimization for knob tuning. Includes lightweight classification for scenario difficulty assessment and reinforcement learning-based decision module with regularization for stability.

Result: Reduces perception consumption by 1.9x to 3.5x and improves driving range by 3.9% to 8.5% while maintaining desired accuracy across various traffic scenarios.

Conclusion: EneAD effectively addresses the energy consumption challenge in autonomous driving through adaptive perception optimization and robust decision-making, achieving significant energy savings without compromising driving performance.

Abstract: Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on largescale deep learning models to extract environmental features. Recently, numerous studies have employed model compression techniques, such as sparsification, quantization, and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. To address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. To adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. Extensive experiments evidence the superiority of our framework in both energy consumption and driving performance. EneAD can reduce perception consumption by 1.9x to 3.5x and thus improve driving range by 3.9% to 8.5%

[237] RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, Bo Zheng

Main category: cs.AI

TL;DR: RAVR uses answer-conditioned reasoning to improve LLM reasoning by leveraging the insight that explaining why an answer is correct is easier than finding the answer from scratch, transforming hard problems into learnable ones.

Details

Motivation: RL for LLMs requires the model to already generate good reasoning paths, but for difficult tasks, sampling such paths is hard and risks reinforcing suboptimal reasoning. The insight that explaining 'why' is easier than finding 'what' motivates using answers to derive reasoning paths.

Method: RAVR (Reference-Answer-guided Variational Reasoning) - an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning, formalizing how conditioning on answers increases reasoning path utility.

Result: Experiments in general and math domains show consistent improvements over strong baselines. Analysis reveals RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific reasoning strategies.

Conclusion: Answer-conditioned reasoning effectively transforms intractable reasoning problems into learnable ones by leveraging the cognitive insight that explanatory reconstruction is easier than open-ended exploration, with RAVR demonstrating practical improvements across domains.

Abstract: Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM’s current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

[238] FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

Kun ouyang, Haoyu Wang, Dong Fang

Main category: cs.AI

TL;DR: FELA is a multi-agent system using LLMs to autonomously generate high-performing, explainable features from complex industrial event log data through collaborative agent workflows and evolutionary algorithms.

Details

Motivation: Industrial event logs are valuable but complex, with large scale, high dimensionality, and heterogeneous data types. Existing AutoML and genetic methods lack explainability, have rigid operations, and poor adaptability to complex data.

Method: FELA uses specialized LLM agents (Idea, Code, Critic, Evaluation) that collaboratively generate, validate, and implement features. It employs an insight-guided self-evolution paradigm with reinforcement learning and genetic algorithm principles for idea space exploration.

Result: Extensive experiments on real industrial datasets show FELA generates explainable, domain-relevant features that significantly improve model performance while reducing manual effort.

Conclusion: LLM-based multi-agent systems show potential as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.

Abstract: Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs–characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures–make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents–Idea Agents, Code Agents, and Critic Agents–to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.

[239] From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen, Haiyang Geng, Mengyue Wu

Main category: cs.AI

TL;DR: Developed PsyCoTalk, the first large-scale dialogue dataset for psychiatric comorbidity using synthetic EMRs and multi-agent diagnostic dialogue generation, validated by psychiatrists.

Details

Motivation: Address the clinical challenge of psychiatric comorbidity by creating a realistic dataset for multi-disorder screening.

Method: Integrated synthetic patient EMR construction with multi-agent diagnostic dialogue generation using hierarchical state machine and context tree supporting 130+ diagnostic states.

Result: Created 3,000 multi-turn diagnostic dialogues from 502 synthetic EMRs, showing high structural and linguistic fidelity compared to real clinical transcripts.

Conclusion: PsyCoTalk enables development of models for multi-disorder psychiatric screening in a single conversational pass, validated by licensed psychiatrists.

Abstract: Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

[240] Counterfactual-based Agent Influence Ranker for Agentic AI Workflows

Amit Giloni, Chiara Picardi, Roy Betser, Shamik Bose, Aishvariya Priya Rathina Sabapathy, Roman Vainshtein

Main category: cs.AI

TL;DR: CAIR is the first method to assess individual agent influence in LLM-based multi-agent systems using counterfactual analysis, providing task-agnostic influence rankings that work both offline and at inference time.

Details

Motivation: There's a need to understand how individual agents influence final outputs in autonomous AI workflows, but no existing methods can assess agent-level influence during inference time execution.

Method: Counterfactual-based Agent Influence Ranker (CAIR) performs counterfactual analysis to determine the influence level of each agent on the workflow’s output, providing task-agnostic analysis.

Result: CAIR produces consistent rankings, outperforms baseline methods, and enhances effectiveness of downstream tasks when evaluated on a dataset of 30 use cases with 230 functionalities.

Conclusion: CAIR is an effective method for assessing agent influence in multi-agent systems that can be used both offline and at inference time, addressing a critical gap in understanding autonomous AI workflows.

Abstract: An Agentic AI Workflow (AAW), also known as an LLM-based multi-agent system, is an autonomous system that assembles several LLM-based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW’s final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual-based Agent Influence Ranker (CAIR) - the first method for assessing the influence level of each agent on the AAW’s output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task-agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.

[241] GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning

Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, Yuhang Yao

Main category: cs.AI

TL;DR: GAP introduces graph-based planning for LLM agents to enable parallel tool execution, overcoming sequential bottlenecks in frameworks like ReAct.

Details

Motivation: Existing sequential reasoning paradigms fail to exploit parallelism among independent sub-tasks, leading to inefficient tool utilization and suboptimal performance in multi-step reasoning.

Method: Trains agent foundation models to decompose tasks into dependency-aware sub-task graphs, using supervised fine-tuning on curated graph-based planning data followed by reinforcement learning with correctness-based rewards.

Result: Significantly outperforms ReAct baselines on multi-step retrieval tasks with dramatic improvements in tool invocation efficiency through intelligent parallelization.

Conclusion: Graph-based planning enables adaptive parallel and serial tool execution, achieving substantial improvements in both execution efficiency and task accuracy for complex multi-step reasoning.

Abstract: Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: https://github.com/WJQ7777/Graph-Agent-Planning.

[242] Grouping Nodes With Known Value Differences: A Lossless UCT-based Abstraction Algorithm

Robin Schmöcker, Alexander Dockhorn, Bodo Rosenhahn

Main category: cs.AI

TL;DR: KVDA-UCT improves MCTS sample efficiency by grouping state-action pairs with known value differences rather than requiring identical values, outperforming state-of-the-art OGA-UCT.

Details

Motivation: Current MCTS abstraction methods like OGA-UCT require state-action pairs to have identical immediate rewards, which limits the number of possible abstractions and reduces sample efficiency.

Method: Proposed Known Value Difference Abstractions (KVDA) framework that groups states and state-action pairs with different values as long as their value differences can be inferred from immediate rewards, then modified OGA-UCT to use this framework as KVDA-UCT.

Result: KVDA-UCT detects significantly more abstractions than OGA-UCT, introduces no additional parameters, and outperforms OGA-UCT across various deterministic environments and parameter settings.

Conclusion: Breaking from the paradigm of grouping only value-equivalent states and instead grouping states with known value differences significantly improves MCTS abstraction effectiveness and sample efficiency.

Abstract: A core challenge of Monte Carlo Tree Search (MCTS) is its sample efficiency, which can be improved by grouping state-action pairs and using their aggregate statistics instead of single-node statistics. On the Go Abstractions in Upper Confidence bounds applied to Trees (OGA-UCT) is the state-of-the-art MCTS abstraction algorithm for deterministic environments that builds its abstraction using the Abstractions of State-Action Pairs (ASAP) framework, which aims to detect states and state-action pairs with the same value under optimal play by analysing the search graph. ASAP, however, requires two state-action pairs to have the same immediate reward, which is a rigid condition that limits the number of abstractions that can be found and thereby the sample efficiency. In this paper, we break with the paradigm of grouping value-equivalent states or state-action pairs and instead group states and state-action pairs with possibly different values as long as the difference between their values can be inferred. We call this abstraction framework Known Value Difference Abstractions (KVDA), which infers the value differences by analysis of the immediate rewards and modifies OGA-UCT to use this framework instead. The modification is called KVDA-UCT, which detects significantly more abstractions than OGA-UCT, introduces no additional parameter, and outperforms OGA-UCT on a variety of deterministic environments and parameter settings.

[243] Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions

Mohamad Abou Ali, Fadi Dornaika

Main category: cs.AI

TL;DR: This survey introduces a dual-paradigm framework for Agentic AI, categorizing systems into Symbolic/Classical (algorithmic planning) and Neural/Generative (stochastic generation) lineages. Through systematic review of 90 studies, it analyzes theoretical foundations, domain implementations, and ethical challenges, revealing strategic paradigm selection patterns and advocating for intentional integration in future AI development.

Details

Motivation: To address the fragmented understanding and conceptual retrofitting in Agentic AI by providing a clear framework that distinguishes between different AI paradigms and their appropriate applications.

Method: Conducted a systematic PRISMA-based review of 90 studies (2018-2025) using a novel dual-paradigm framework, analyzing across theoretical foundations, domain implementations, and ethical challenges.

Result: Found that symbolic systems dominate safety-critical domains (healthcare) while neural systems prevail in adaptive, data-rich environments (finance). Identified research gaps in governance for symbolic systems and need for hybrid neuro-symbolic architectures.

Conclusion: The future of Agentic AI lies in intentional integration of both paradigms to create systems that are both adaptable and reliable, rather than dominance of one approach.

Abstract: Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models – a practice known as conceptual retrofitting. This survey cuts through this confusion by introducing a novel dual-paradigm framework that categorizes agentic systems into two distinct lineages: the Symbolic/Classical (relying on algorithmic planning and persistent state) and the Neural/Generative (leveraging stochastic generation and prompt-driven orchestration). Through a systematic PRISMA-based review of 90 studies (2018–2025), we provide a comprehensive analysis structured around this framework across three dimensions: (1) the theoretical foundations and architectural principles defining each paradigm; (2) domain-specific implementations in healthcare, finance, and robotics, demonstrating how application constraints dictate paradigm selection; and (3) paradigm-specific ethical and governance challenges, revealing divergent risks and mitigation strategies. Our analysis reveals that the choice of paradigm is strategic: symbolic systems dominate safety-critical domains (e.g., healthcare), while neural systems prevail in adaptive, data-rich environments (e.g., finance). Furthermore, we identify critical research gaps, including a significant deficit in governance models for symbolic systems and a pressing need for hybrid neuro-symbolic architectures. The findings culminate in a strategic roadmap arguing that the future of Agentic AI lies not in the dominance of one paradigm, but in their intentional integration to create systems that are both adaptable and reliable. This work provides the essential conceptual toolkit to guide future research, development, and policy toward robust and trustworthy hybrid intelligent systems.

[244] Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

Willem Fourie

Main category: cs.AI

TL;DR: The paper argues that instrumental goals in AI systems should be accepted as inherent features rather than treated as failures, using Aristotle’s ontology to frame them as natural outcomes of system constitution that should be managed rather than eliminated.

Details

Motivation: To challenge conventional AI alignment approaches that treat instrumental goals as risks to be limited, and propose an alternative philosophical framework for understanding and managing these goals.

Method: Draws on Aristotle’s ontology and modern interpretations to construct a philosophical argument that frames instrumental goals as per se outcomes of AI system constitution rather than accidental malfunctions.

Result: Develops a theoretical framework where instrumental goals are seen as inherent features of goal-directed AI systems that arise from their formal and material constitution.

Conclusion: AI alignment efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends, accepting them as natural features of advanced AI systems.

Abstract: In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle’s ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers’ intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.

[245] HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics

Sizhou Chen, Shufan Jiang, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Main category: cs.AI

TL;DR: HAMLET is a multi-agent framework that uses LLMs to generate immersive theatrical experiences with autonomous actors who can interact with physical scenes and make independent decisions.

Details

Motivation: Existing LLM-based drama generation methods lack agent initiative, cannot interact with physical scenes, and require detailed user input, reducing interactivity and immersion in real-time performances.

Method: HAMLET generates narrative blueprints from simple topics, then uses autonomous actors with independent decision-making based on background, goals, and emotions. Actors can interact with scene props and broadcast changes to influence others.

Result: Experimental evaluation shows HAMLET creates expressive and coherent theatrical experiences, assessed through character performance, narrative quality, and interaction experience metrics.

Conclusion: HAMLET successfully addresses limitations of existing methods by enabling autonomous actors with physical scene interaction, creating more immersive and interactive theatrical performances.

Abstract: Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in agents that lack initiative and cannot interact with the physical scene. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance generated by HAMLET, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences.

[246] Multi-Objective Search: Algorithms, Applications, and Emerging Directions

Oren Salzman, Carlos Hernández Ulloa, Ariel Felner, Sven Koenig

Main category: cs.AI

TL;DR: This paper surveys multi-objective search (MOS) as a unifying framework for planning and decision-making with multiple conflicting criteria, highlighting recent AI applications and outlining future challenges.

Details

Motivation: Real-world systems rarely optimize a single measure, requiring frameworks that balance multiple conflicting criteria across applications like robotics, transportation, and operations research.

Method: The paper conducts a survey of developments in multi-objective search, examining cross-disciplinary opportunities and analyzing the current state of the field.

Result: The survey identifies renewed interest in MOS across AI applications and highlights the framework’s unifying role in planning and decision-making problems with multiple objectives.

Conclusion: The paper outlines open challenges that define the emerging frontier of multi-objective search, suggesting directions for future research and development in this important area.

Abstract: Multi-objective search (MOS) has emerged as a unifying framework for planning and decision-making problems where multiple, often conflicting, criteria must be balanced. While the problem has been studied for decades, recent years have seen renewed interest in the topic across AI applications such as robotics, transportation, and operations research, reflecting the reality that real-world systems rarely optimize a single measure. This paper surveys developments in MOS while highlighting cross-disciplinary opportunities, and outlines open challenges that define the emerging frontier of MOS

[247] MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL

Zekun Xu, Siyu Xia, Chuhuai Yue, Jiajun Chai, Mingxue Tian, Xiaohan Wang, Wei Lin, Haoxuan Li, Guojun Yin

Main category: cs.AI

TL;DR: MTIR-SQL is a multi-turn tool-integrated reinforcement learning framework for Text-to-SQL that incorporates dynamic database execution feedback at each reasoning step, enabling progressive query refinement and significantly outperforming existing methods.

Details

Motivation: Existing RL methods for Text-to-SQL primarily rely on static execution feedback, which restricts real-time error correction. The authors propose that integrating multi-turn tool invocation with dynamic feedback could significantly improve adaptability, robustness, and model performance.

Method: The approach introduces an execution-aware multi-turn reasoning paradigm that incorporates database execution feedback at each reasoning step. It extends the GRPO algorithm for multi-turn interaction scenarios, enhanced with trajectory filtering mechanism and removal of KL loss constraints to address training instability and distribution deviation issues.

Result: MTIR-SQL with 4B parameters achieves 64.4% accuracy in BIRD Dev and 84.6% execution accuracy in SPIDER Dev, significantly outperforming existing approaches.

Conclusion: The proposed MTIR-SQL framework effectively addresses the limitations of static execution feedback in Text-to-SQL tasks by enabling dynamic, multi-turn reasoning with real-time database feedback, leading to substantial performance improvements over existing methods.

Abstract: As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves \textbf{64.4}% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.

[248] Predicate Renaming via Large Language Models

Elisabetta Gentili, Tony Ribeiro, Fabrizio Riguzzi, Katsumi Inoue

Main category: cs.AI

TL;DR: Using LLMs to name unnamed predicates in logic rules to improve readability and interpretability.

Details

Motivation: Unnamed predicates in logic rules hinder readability, interpretability, and reusability, especially in Predicate Invention contexts.

Method: Leverage LLMs’ natural language and code processing capabilities to suggest semantically meaningful names for unnamed predicates.

Result: Evaluation on hand-crafted logic rules shows LLMs have potential for this naming task.

Conclusion: LLMs show promise for automatically naming predicates in logic rules to enhance theory usability.

Abstract: In this paper, we address the problem of giving names to predicates in logic rules using Large Language Models (LLMs). In the context of Inductive Logic Programming, various rule generation methods produce rules containing unnamed predicates, with Predicate Invention being a key example. This hinders the readability, interpretability, and reusability of the logic theory. Leveraging recent advancements in LLMs development, we explore their ability to process natural language and code to provide semantically meaningful suggestions for giving a name to unnamed predicates. The evaluation of our approach on some hand-crafted logic rules indicates that LLMs hold potential for this task.

[249] Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation

Thomas Cook, Richard Osuagwu, Liman Tsatiashvili, Vrynsia Vrynsia, Koustav Ghosal, Maraim Masoud, Riccardo Mattivi

Main category: cs.AI

TL;DR: Agentic RAG architecture for fintech domains improves retrieval precision and relevance through specialized agents for query reformulation, sub-query decomposition, acronym resolution, and context re-ranking, though with increased latency.

Details

Motivation: Standard RAG systems struggle in specialized domains like fintech due to domain-specific ontologies, dense terminology, and acronyms that complicate effective retrieval and synthesis.

Method: Modular pipeline of specialized agents supporting intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking.

Result: Outperforms standard RAG baseline in retrieval precision and relevance on a curated dataset of 85 question-answer-reference triples from enterprise fintech knowledge base, but with increased latency.

Conclusion: Structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.

Abstract: Retrieval-Augmented Generation (RAG) systems often face limitations in specialized domains such as fintech, where domain-specific ontologies, dense terminology, and acronyms complicate effective retrieval and synthesis. This paper introduces an agentic RAG architecture designed to address these challenges through a modular pipeline of specialized agents. The proposed system supports intelligent query reformulation, iterative sub-query decomposition guided by keyphrase extraction, contextual acronym resolution, and cross-encoder-based context re-ranking. We evaluate our approach against a standard RAG baseline using a curated dataset of 85 question–answer–reference triples derived from an enterprise fintech knowledge base. Experimental results demonstrate that the agentic RAG system outperforms the baseline in retrieval precision and relevance, albeit with increased latency. These findings suggest that structured, multi-agent methodologies offer a promising direction for enhancing retrieval robustness in complex, domain-specific settings.

[250] Zero Reinforcement Learning Towards General Domains

Yuyuan Zeng, Yufei Huang, Can Xu, Qingfeng Sun, Jianfeng Yan, Guanghui Xu, Tao Yang, Fengzong Lian

Main category: cs.AI

TL;DR: Zero-RL enhances LLM reasoning without supervised fine-tuning, but current methods focus on verifiable domains. This paper proposes a novel zero-RL paradigm that combines verifiable rewards with generative reward models for multi-task training across both verifiable and non-verifiable domains, using smooth length penalty to prevent reward hacking.

Details

Motivation: Current zero-RL research primarily focuses on domains with easily verifiable reward signals (mathematics, programming), leaving more diverse scenarios with non-verifiable rewards underexplored. There's a need to improve reasoning abilities across both verifiable and non-verifiable domains.

Method: Proposes a zero-RL paradigm combining verifiable rewards with generative reward models for multi-task training across both domains. Uses smooth length penalty to mitigate reward hacking and encourage comprehensive thinking tokens in general domains.

Result: Experimental results on Qwen3-8B-Base and Qwen3-14B-Base show superior reasoning performance on both tasks requiring extensive reasoning and more general tasks.

Conclusion: The proposed approach successfully transfers reasoning capabilities between verifiable and non-verifiable domains, achieving improved performance across diverse reasoning scenarios through multi-task zero-RL training.

Abstract: Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model’s reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.

[251] Off-policy Reinforcement Learning with Model-based Exploration Augmentation

Likun Wang, Xiangteng Zhang, Yinuo Wang, Guojian Zhan, Wenxuan Wang, Haoyu Gao, Jingliang Duan, Shengbo Eben Li

Main category: cs.AI

TL;DR: MoGE is a novel exploration method that generates under-explored critical states using diffusion models and creates dynamics-consistent experiences through transition models to enhance RL exploration without changing core algorithms.

Details

Motivation: Existing exploration methods have limitations: active exploration struggles in high-dimensional environments, while passive exploration is constrained by limited sample diversity in replay buffers.

Method: MoGE consists of two components: (1) a diffusion-based generator that synthesizes critical states guided by a utility function evaluating state influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the generated states.

Result: Empirical results on OpenAI Gym and DeepMind Control Suite show that MoGE effectively bridges exploration and policy learning, achieving significant improvements in both sample efficiency and performance across complex control tasks.

Conclusion: MoGE provides a modular approach that seamlessly integrates with existing off-policy RL algorithms to improve exploration without altering their core structures, addressing limitations in current exploration methods.

Abstract: Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state’s potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.

[252] Standardization of Psychiatric Diagnoses – Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System

Eranga Bandara, Ross Gore, Atmaram Yarlagadda, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty

Main category: cs.AI

TL;DR: A fine-tuned LLM consortium integrated with OpenAI-gpt-oss reasoning LLM for standardized mental health diagnosis, addressing variability in psychiatric evaluations.

Details

Motivation: To address the subjectivity and variability in psychiatric diagnoses across clinicians and patients, which leads to inconsistencies and challenges in achieving reliable outcomes.

Method: Leverages fine-tuned LLMs trained on psychiatrist-patient interaction datasets, aggregates predictions through consensus-based decision-making, and refines with OpenAI-gpt-oss reasoning LLM. Uses LLM agents to orchestrate communication between consortium and reasoning model.

Result: Experimental results demonstrate transformative potential with robust and highly accurate diagnostic system. A prototype was developed in collaboration with U.S. Army Medical Research Team.

Conclusion: This represents the first application of fine-tuned LLM consortium integrated with reasoning LLM for clinical mental health diagnosis, paving way for next-generation AI-powered eHealth systems to standardize psychiatric diagnoses.

Abstract: The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine-Tuned Large Language Model (LLM) Consortium and OpenAI-gpt-oss Reasoning LLM-enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine-tuned LLMs trained on conversational datasets involving psychiatrist-patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus-based decision-making process, refined by the OpenAI-gpt-oss reasoning LLM. We propose a novel method for deploying LLM agents that orchestrate communication between the LLM consortium and the reasoning LLM, ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine-tuned LLMs with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine-tuned LLMs with the OpenAI-gpt-oss reasoning LLM, was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine-tuned LLM consortium integrated with a reasoning LLM for clinical mental health diagnosis paving the way for next-generation AI-powered eHealth systems aimed at standardizing psychiatric diagnoses.

Federica Tonti, Ricardo Vinuesa

Main category: cs.AI

TL;DR: A flow-aware Deep Reinforcement Learning approach using PPO+GTrXL for optimal UAV navigation in turbulent urban flows, outperforming traditional methods and other RL variants.

Details

Motivation: UAVs are increasingly used in urban areas for delivery and surveillance, requiring robust navigation strategies to handle complex turbulent flow environments with recirculation zones.

Method: Proximal Policy Optimization (PPO) combined with Gated Transformer eXtra Large (GTrXL) architecture, enhanced with secondary prediction tasks to provide richer information about turbulent flow fields. Tested in 3D high-fidelity urban flow simulations.

Result: Significant increase in success rate (SR) and lower crash rate (CR) compared to PPO+LSTM, PPO+GTrXL without secondary tasks, and traditional Zermelo’s navigation algorithm.

Conclusion: The flow-aware PPO+GTrXL approach enables more effective UAV navigation in complex urban environments, potentially transforming UAV operations in turbulent urban settings.

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly populating urban areas for delivery and surveillance purposes. In this work, we develop an optimal navigation strategy based on Deep Reinforcement Learning. The environment is represented by a three-dimensional high-fidelity simulation of an urban flow, characterized by turbulence and recirculation zones. The algorithm presented here is a flow-aware Proximal Policy Optimization (PPO) combined with a Gated Transformer eXtra Large (GTrXL) architecture, giving the agent richer information about the turbulent flow field in which it navigates. The results are compared with a PPO+GTrXL without the secondary prediction tasks, a PPO combined with Long Short Term Memory (LSTM) cells and a traditional navigation algorithm. The obtained results show a significant increase in the success rate (SR) and a lower crash rate (CR) compared to a PPO+LSTM, PPO+GTrXL and the classical Zermelo’s navigation algorithm, paving the way to a completely reimagined UAV landscape in complex urban environments.

[254] BambooKG: A Neurobiologically-inspired Frequency-Weight Knowledge Graph

Vanya Arikutharam, Arkadiy Ukolov

Main category: cs.AI

TL;DR: BambooKG is a weighted knowledge graph that improves retrieval-augmented generation by capturing non-triplet relationships, reducing information loss and enhancing multi-hop reasoning.

Details

Motivation: Current retrieval-augmented generation treats retrieved chunks independently and struggles with multi-hop reasoning, while traditional knowledge graphs miss non-triplet information.

Method: Introduces BambooKG with frequency-based weights on non-triplet edges, applying the Hebbian principle ‘fire together, wire together’ to reflect link strength.

Result: BambooKG decreases information loss and outperforms existing solutions on both single- and multi-hop reasoning tasks.

Conclusion: The weighted knowledge graph approach with non-triplet edges effectively enhances reasoning capabilities in retrieval-augmented generation systems.

Abstract: Retrieval-Augmented Generation allows LLMs to access external knowledge, reducing hallucinations and ageing-data issues. However, it treats retrieved chunks independently and struggles with multi-hop or relational reasoning, especially across documents. Knowledge graphs enhance this by capturing the relationships between entities using triplets, enabling structured, multi-chunk reasoning. However, these tend to miss information that fails to conform to the triplet structure. We introduce BambooKG, a knowledge graph with frequency-based weights on non-triplet edges which reflect link strength, drawing on the Hebbian principle of “fire together, wire together”. This decreases information loss and results in improved performance on single- and multi-hop reasoning, outperforming the existing solutions.

[255] TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling

He Hu, Yucheng Zhou, Chiyuan Ma, Qianning Wang, Zheng Zhang, Fei Ma, Laizhong Cui, Qi Tian

Main category: cs.AI

TL;DR: TheraMind introduces a dual-loop architecture for longitudinal psychological counseling that separates tactical dialogue management from strategic therapeutic planning, enabling emotional understanding and adaptive therapy across multiple sessions.

Details

Motivation: Existing LLM approaches in psychological counseling lack emotional understanding, adaptive strategies, and long-term memory across multiple sessions, making them inadequate for real clinical practice.

Method: A novel dual-loop architecture with Intra-Session Loop for tactical dialogue management (emotional state perception, dynamic strategy selection) and Cross-Session Loop for strategic therapeutic planning (efficacy evaluation, method adjustment across sessions).

Result: TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validated in high-fidelity simulation environment based on real clinical cases.

Conclusion: The dual-loop design effectively emulates strategic, adaptive, and longitudinal therapeutic behavior, bridging the gap between AI counseling and real clinical practice.

Abstract: Large language models (LLMs) in psychological counseling have attracted increasing attention. However, existing approaches often lack emotional understanding, adaptive strategies, and the use of therapeutic methods across multiple sessions with long-term memory, leaving them far from real clinical practice. To address these critical gaps, we introduce TheraMind, a strategic and adaptive agent for longitudinal psychological counseling. The cornerstone of TheraMind is a novel dual-loop architecture that decouples the complex counseling process into an Intra-Session Loop for tactical dialogue management and a Cross-Session Loop for strategic therapeutic planning. The Intra-Session Loop perceives the patient’s emotional state to dynamically select response strategies while leveraging cross-session memory to ensure continuity. Crucially, the Cross-Session Loop empowers the agent with long-term adaptability by evaluating the efficacy of the applied therapy after each session and adjusting the method for subsequent interactions. We validate our approach in a high-fidelity simulation environment grounded in real clinical cases. Extensive evaluations show that TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validating the effectiveness of its dual-loop design in emulating strategic, adaptive, and longitudinal therapeutic behavior. The code is publicly available at https://0mwwm0.github.io/TheraMind/.

[256] Brain-inspired Computational Intelligence via Predictive Coding

Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, Alexander Ororbia

Main category: cs.AI

TL;DR: Survey of predictive coding (PC) as a biologically plausible alternative to backpropagation for deep neural networks, covering history, current developments, and future directions.

Details

Motivation: Backpropagation is biologically implausible, while predictive coding offers a neuroscience-inspired approach with promising properties for machine learning.

Method: Comprehensive survey of predictive coding literature, providing historical overview, current research efforts, and analysis of recent developments.

Result: PC shows potential for modeling brain information processing, robotics applications, variational inference foundation, and asynchronous computation.

Conclusion: Predictive coding represents a promising biologically plausible alternative to backpropagation with broad implications for machine learning and AI development.

Abstract: Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with a learning algorithm called error backpropagation, always considered biologically implausible. To this end, recent works have studied learning algorithms for deep neural networks inspired by the neurosciences. One such theory, called predictive coding (PC), has shown promising properties that make it potentially valuable for the machine learning community: it can model information processing in different areas of the brain, can be used in control and robotics, has a solid mathematical foundation in variational inference, and performs its computations asynchronously. Inspired by such properties, works that propose novel PC-like algorithms are starting to be present in multiple sub-fields of machine learning and AI at large. Here, we survey such efforts by first providing a broad overview of the history of PC to provide common ground for the understanding of the recent developments, then by describing current efforts and results, and concluding with a large discussion of possible implications and ways forward.

[257] CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models

Son The Nguyen, Niranjan Uma Naresh, Theja Tulabandhula

Main category: cs.AI

TL;DR: Proposes a robust ranking algorithm to handle incomplete and corrupted data in LLM preference learning, enabling recovery of optimal rankings despite adversarial noise and missing comparisons.

Details

Motivation: Address challenges in aligning LLMs with human values via preference learning when datasets contain incomplete and corrupted pairwise comparison data.

Method: Develops a guaranteed polynomial-time ranking algorithm that robustifies existing models like BTL and its generalizations, allowing recovery of ε-optimal rankings with high probability even with O(n) perturbed comparisons per response.

Result: Algorithm successfully handles adversarial noise and unobserved comparisons in both general and LLM preference datasets, with robust recovery in partially observed settings.

Conclusion: Provides dataset curation pipeline with ability to handle missing and manipulated inputs, contributing to development of more reliable and ethically aligned AI models.

Abstract: This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs’ resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley-Terry-Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an $\epsilon$-optimal ranking with high probability while allowing as large as $O(n)$ perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.

[258] TraveLLM: Could you plan my new public transit route in face of a network disruption?

Bowen Fang, Zixiao Yang, Xuan Di

Main category: cs.AI

TL;DR: TraveLLM uses LLMs for disruption-aware public transit routing, processing multimodal queries to generate context-aware navigation plans that handle real-time disruptions and user constraints.

Details

Motivation: Existing navigation systems fail during urban disruptions and struggle to incorporate real-time events and complex user constraints like area avoidance.

Method: Leverage LLMs’ reasoning capabilities to process multimodal user queries combining natural language requests with map data, and benchmark performance of state-of-the-art LLMs including GPT-4, Claude 3, and Gemini.

Result: LLMs, notably GPT-4, can effectively generate viable and context-aware navigation plans under demanding disruption conditions.

Conclusion: LLMs show promise for building more flexible and intelligent navigation systems capable of handling dynamic disruptions and diverse user needs.

Abstract: Existing navigation systems often fail during urban disruptions, struggling to incorporate real-time events and complex user constraints, such as avoiding specific areas. We address this gap with TraveLLM, a system using Large Language Models (LLMs) for disruption-aware public transit routing. We leverage LLMs’ reasoning capabilities to directly process multimodal user queries combining natural language requests (origin, destination, preferences, disruption info) with map data (e.g., subway, bus, bike-share). To evaluate this approach, we design challenging test scenarios reflecting real-world disruptions like weather events, emergencies, and dynamic service availability. We benchmark the performance of state-of-the-art LLMs, including GPT-4, Claude 3, and Gemini, on generating accurate travel plans. Our experiments demonstrate that LLMs, notably GPT-4, can effectively generate viable and context-aware navigation plans under these demanding conditions. These findings suggest a promising direction for using LLMs to build more flexible and intelligent navigation systems capable of handling dynamic disruptions and diverse user needs.

[259] SNN-Based Online Learning of Concepts and Action Laws in an Open World

Christel Grimaud, Dominique Longin, Andreas Herzig

Main category: cs.AI

TL;DR: A fully autonomous cognitive agent using spiking neural networks for semantic memory that learns object/situation and action concepts in one-shot manner, enabling decision-making through outcome prediction.

Details

Motivation: To create a bio-inspired cognitive agent capable of autonomous exploration and learning of concepts about its environment and actions, enabling adaptive decision-making in dynamic situations.

Method: Uses spiking neural network (SNN) for semantic memory implementation. Learns object/situation concepts as unary and action concepts as triples (initial situation, motor activity, outcome). Employs one-shot learning and queries semantic memory for expected outcomes to make decisions.

Result: The agent successfully handles new situations by leveraging previously learned general concepts and rapidly adapts its concepts to environmental changes through one-shot learning.

Conclusion: The bio-inspired cognitive agent with SNN-based semantic memory demonstrates effective autonomous learning and decision-making capabilities, showing adaptability to new situations and environment changes through concept generalization and modification.

Abstract: We present the architecture of a fully autonomous, bio-inspired cognitive agent built around a spiking neural network (SNN) implementing the agent’s semantic memory. This agent explores its universe and learns concepts of objects/situations and of its own actions in a one-shot manner. While object/situation concepts are unary, action concepts are triples made up of an initial situation, a motor activity, and an outcome. They embody the agent’s knowledge of its universe’s action laws. Both kinds of concepts have different degrees of generality. To make decisions the agent queries its semantic memory for the expected outcomes of envisaged actions and chooses the action to take on the basis of these predictions. Our experiments show that the agent handles new situations by appealing to previously learned general concepts and rapidly modifies its concepts to adapt to environment changes.

[260] Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Simeng Han, Howard Dai, Stephen Xia, Grant Zhang, Chen Liu, Lichang Chen, Hoang Huy Nguyen, Hongyuan Mei, Jiayuan Mao, R. Thomas McCoy

Main category: cs.AI

TL;DR: A benchmark using brainteasers to evaluate LLMs’ reasoning strategies beyond accuracy, focusing on solution quality and creativity.

Details

Motivation: Accuracy alone provides limited insight into how models solve problems. Brainteasers allow analysis of different reasoning approaches (creative vs brute force) that models employ.

Method: Evaluated LLMs across multiple reasoning layers: semantic parsing of brainteasers into mathematical formats, solution generation, self-correction, step-by-step solution sketches, and hint utilization.

Result: LLMs can find creative, insightful solutions to brainteasers, demonstrating capacity for novel problem-solving. However, they sometimes use brute force when more efficient creative solutions are available.

Conclusion: LLMs capture some creative problem-solving abilities but need improvement in consistently choosing efficient reasoning strategies over brute force approaches.

Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

[261] Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

Main category: cs.AI

TL;DR: TON is a two-stage training method that enables vision-language models to selectively reason only when necessary, reducing computational costs by up to 90% while maintaining or improving performance.

Details

Motivation: Current RL methods like GRPO force models to always generate full reasoning traces, leading to unnecessary token usage and computational costs. The goal is to enable human-like selective reasoning where models only think when needed.

Method: Two-stage training: (1) SFT with ’thought dropout’ that randomly replaces reasoning traces with empty thoughts, creating a think-or-not format; (2) GRPO stage that allows models to freely explore when to think while maximizing task rewards.

Result: TON reduces completion length by up to 90% compared to vanilla GRPO without performance loss. Models progressively learn to skip unnecessary reasoning across various tasks (GSM8K, CLEVR, Super-CLEVR, GeoQA, AITZ) and model sizes (3B, 7B).

Conclusion: TON enables human-like selective reasoning patterns in VLMs, demonstrating that models can learn to bypass unnecessary reasoning steps while maintaining performance, paving the way for more efficient RL approaches.

Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective ’thought dropout’ operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

[262] Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

Bálint Gyevnár, Christopher G. Lucas, Stefano V. Albrecht, Shay B. Cohen

Main category: cs.AI

TL;DR: AXIS is a framework that uses LLMs to generate human-centered explanations for multi-agent systems by interrogating environment simulators with counterfactual queries.

Details

Motivation: Autonomous multi-agent systems raise trust concerns due to risks like miscoordination and goal misalignment, requiring explainability for user trust calibration.

Method: AXIS leverages counterfactual effect size model and LLMs to interrogate environment simulators using ‘whatif’ and ‘remove’ prompts to generate multi-round counterfactual explanations.

Result: AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for four models, with comparable action prediction accuracy.

Conclusion: AXIS achieves the highest overall scores in explanation quality and effectively addresses trust concerns in multi-agent systems through human-centered counterfactual explanations.

Abstract: Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks such as miscoordination or goal misalignment. Explainability is vital for users’ trust calibration, but explainable MAS face challenges due to complex environments, the human factor, and non-standardised evaluation. Leveraging the counterfactual effect size model and LLMs, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates human-centred action explanations for multi-agent policies by having an LLM interrogate an environment simulator using prompts like ‘whatif’ and ‘remove’ to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across ten scenarios for five LLMs with a comprehensive methodology combining robustness, subjective preference, correctness, and goal/action prediction with an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for four models, with comparable action prediction accuracy, achieving the highest scores overall. Our code is open-sourced at https://github.com/gyevnarb/axis.

[263] PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, Edward Choi

Main category: cs.AI

TL;DR: PatientSim is a patient simulator that generates diverse patient personas for clinical scenarios using clinical profiles from MIMIC datasets and four persona axes, validated by clinicians and serving as a testbed for medical dialogue systems.

Details

Motivation: Existing patient simulators fail to reflect the full range of personas seen in clinical practice, creating a need for realistic patient interaction systems for training and evaluating doctor LLMs.

Method: Uses clinical profiles from MIMIC-ED and MIMIC-IV datasets, and defines personas along four axes: personality, language proficiency, medical history recall level, and cognitive confusion level (37 combinations total).

Result: Evaluated eight LLMs for factual accuracy and persona consistency, with Llama 3.3 70B as top-performing open-source model. Validated by four clinicians confirming framework robustness.

Conclusion: PatientSim provides a reproducible, scalable, privacy-compliant solution for evaluating medical dialogue systems and shows promise as an educational tool for healthcare.

Abstract: Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluate eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3 70B, is validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare. The code is available at https://github.com/dek924/PatientSim.

[264] Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

Davin Choo, Yuqi Pan, Tonghan Wang, Milind Tambe, Alastair van Heerden, Cheryl Johnson

Main category: cs.AI

TL;DR: A sequential decision-making problem on graphs with unknown node labels, where nodes must be selected adaptively under frontier exploration constraints to maximize discounted rewards. The paper proposes a Gittins index-based policy that is optimal for tree graphs and efficient to implement.

Details

Motivation: Address practical constraints in applications like contact tracing and robotic exploration where actions are limited to neighbors of previously selected nodes (frontier exploration), requiring adaptive node selection strategies for unknown label distributions.

Method: Design a Gittins index-based policy that applies to general graphs, with provable optimality when the graph is a forest. The implementation runs in O(n²·|Ω|²) time with O(n·|Ω|²) oracle calls to the distribution P and O(n²·|Ω|) space.

Result: The proposed method consistently outperforms natural baselines in synthetic and real-world graphs, including non-tree, budget-limited, and undiscounted settings. In HIV testing simulations on sexual interaction networks, it detects nearly all positive cases with only half the population tested.

Conclusion: The Gittins index-based policy provides an effective solution for sequential decision-making under frontier exploration constraints, demonstrating strong performance across various graph structures and practical applications.

Abstract: We study a sequential decision-making problem on a $n$-node graph $\mathcal{G}$ where each node has an unknown label from a finite set $\mathbf{\Omega}$, drawn from a joint distribution $\mathcal{P}$ that is Markov with respect to $\mathcal{G}$. At each step, selecting a node reveals its label and yields a label-dependent reward. The goal is to adaptively choose nodes to maximize expected accumulated discounted rewards. We impose a frontier exploration constraint, where actions are limited to neighbors of previously selected nodes, reflecting practical constraints in settings such as contact tracing and robotic exploration. We design a Gittins index-based policy that applies to general graphs and is provably optimal when $\mathcal{G}$ is a forest. Our implementation runs in $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|^2)$ time while using $\mathcal{O}(n \cdot |\mathbf{\Omega}|^2)$ oracle calls to $\mathcal{P}$ and $\mathcal{O}(n^2 \cdot |\mathbf{\Omega}|)$ space. Experiments on synthetic and real-world graphs show that our method consistently outperforms natural baselines, including in non-tree, budget-limited, and undiscounted settings. For example, in HIV testing simulations on real-world sexual interaction networks, our policy detects nearly all positive cases with only half the population tested, substantially outperforming other baselines.

[265] Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra

Main category: cs.AI

TL;DR: The paper proposes Active Indexing, a two-stage training method that enables LLMs to generate reliable citations without test-time retrieval by binding factual knowledge to document identifiers during continual pretraining and then instruction tuning.

Details

Motivation: Current citation systems rely on external retrievers at inference time, which introduces latency, infrastructure dependence, and vulnerability to retrieval noise. The goal is to make LLMs reliably attribute to documents seen during training without test-time retrieval.

Method: Two-stage approach: (1) Continual pretraining with Active Indexing that creates source-anchored bindings using synthetic data with diverse restatements and bidirectional training (source-to-fact and fact-to-source); (2) Instruction tuning to elicit citation behavior. Benchmark uses CitePretrainBench mixing real-world corpora with novel documents.

Result: Active Indexing consistently outperforms Passive Indexing baseline, achieving citation precision gains up to 30.2% across all tasks and models. Performance improves with more augmented data, showing upward trend even at 16x original token count. Internal citations complement external ones by improving robustness to retrieval noise.

Conclusion: LLMs can be trained to generate reliable citations without test-time retrieval through proper training methodology. Active Indexing effectively creates generalizable, source-anchored bindings that improve citation accuracy and robustness to paraphrase and composition.

Abstract: Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

[266] When Truthful Representations Flip Under Deceptive Instructions?

Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li

Main category: cs.AI

TL;DR: This paper investigates how deceptive instructions alter the internal representations of LLMs, revealing distinct representational patterns between truthful and deceptive responses that can be detected through linear probes and sparse autoencoders.

Details

Motivation: LLMs can follow malicious instructions to generate deceptive responses, posing safety challenges, but how deceptive instructions alter internal representations compared to truthful ones remains poorly understood beyond output analysis.

Method: Analyzed internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on factual verification tasks using linear probes and Sparse Autoencoders (SAEs) to detect representational shifts and identify deception-sensitive features.

Result: Deceptive instructions induce significant representational shifts concentrated in early-to-mid layers, with specific SAE features highly sensitive to deceptive instructions, revealing distinct truthful/deceptive representational subspaces.

Conclusion: The findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs through internal representation analysis.

Abstract: Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip’’, such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model’s instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces. % Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty, offering insights for LLM detection and control. Our findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs.

[267] Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Hui Li, Tong Li

Main category: cs.AI

TL;DR: Pentest-R1 is a reinforcement learning framework that enhances LLMs for automated penetration testing, achieving state-of-the-art performance on cybersecurity benchmarks through a two-stage RL pipeline combining offline learning from attack walkthroughs and online learning in CTF environments.

Details

Motivation: Current LLMs have limitations in penetration testing including poor error handling, inefficient reasoning, and inability to perform complex end-to-end tasks autonomously, which Pentest-R1 aims to address.

Method: Two-stage reinforcement learning pipeline: 1) Offline RL using 500+ real-world multi-step attack walkthroughs to instill foundational attack logic, 2) Online RL fine-tuning in interactive CTF environment to learn error self-correction and adaptive strategies from environmental feedback.

Result: Achieved 24.2% success rate on AutoPenBench (surpassing most SOTA models, second only to Gemini 2.5 Flash) and 15.0% success rate on Cybench in unguided tasks (new SOTA for open-source LLMs, matching top proprietary models).

Conclusion: The synergy of both offline and online RL training stages is critical for Pentest-R1’s success in automated penetration testing, demonstrating the framework’s effectiveness in enhancing LLM reasoning capabilities for cybersecurity tasks.

Abstract: Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework’s effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.

[268] UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss

Zhichao Wang, Xinhai Chen, Qinglin Wang, Xiang Gao, Qingyang Zhang, Menghan Jia, Xiang Zhang, Jie Liu

Main category: cs.AI

TL;DR: UGM2N is an unsupervised mesh movement network that dynamically relocates mesh nodes to rapidly-varying regions using localized geometric feature learning and physics-constrained loss, achieving equation-agnostic generalization across diverse PDEs and mesh topologies.

Details

Motivation: Traditional mesh movement techniques suffer from high computational complexity and geometric inflexibility, while supervised learning approaches struggle with zero-shot generalization across diverse PDEs and mesh topologies.

Method: Unsupervised mesh adaptation through localized geometric feature learning with a physics-constrained M-Uniform loss function that enforces mesh equidistribution at the nodal level.

Result: The network demonstrates equation-agnostic generalization, geometric independence, consistent superiority over existing methods, robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions, and guaranteed error reduction without mesh tangling.

Conclusion: UGM2N provides an effective unsupervised solution for mesh movement that overcomes limitations of traditional approaches and supervised learning methods, enabling efficient and accurate mesh adaptation across various physical systems.

Abstract: Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh topologies.In this paper, we present an Unsupervised and Generalizable Mesh Movement Network (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal level.Experimental results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling.

[269] Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Ashmi Banerjee, Fitri Nur Aisyah, Adithi Satish, Wolfgang Wörndl, Yashar Deldjoo

Main category: cs.AI

TL;DR: Collab-REC is a multi-agent framework using LLM-based agents to improve tourism recommendation diversity and reduce popularity bias through collaborative negotiation.

Details

Motivation: To counteract popularity bias and enhance diversity in tourism recommendations, addressing over-tourism by surfacing lesser-visited locales that are often overlooked.

Method: Three LLM-based agents (Personalization, Popularity, Sustainability) generate city suggestions from complementary perspectives, with a non-LLM moderator merging and refining proposals through multi-round negotiation while penalizing spurious or repeated responses.

Result: Experiments on European city queries show improved diversity and overall relevance compared to single-agent baseline, successfully surfacing lesser-visited locales.

Conclusion: The balanced, context-aware approach addresses over-tourism and better aligns with user constraints, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.

Abstract: We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents – Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent’s viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.

[270] GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models

Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, Tommaso Di Noia

Main category: cs.AI

TL;DR: This paper proposes using Outcome Reward Models (ORMs) as a test-time heuristic for Text-to-SQL tasks, showing they outperform traditional methods like execution-based Best-of-N and Majority Voting.

Details

Motivation: Current LLMs struggle with complex SQL queries, and existing test-time strategies rely on surface-level heuristics rather than semantic correctness.

Method: Developed a unified framework for training ORMs tailored to Text-to-SQL and benchmarked them against ex-BoN and Maj across BIRD and Spider datasets using various LLM families.

Result: ORMs achieved execution accuracy gains of +4.33% on BIRD and +2.10% on Spider over ex-BoN, and +2.91% on BIRD and +0.93% on Spider over Maj.

Conclusion: ORMs are an effective test-time heuristic that outperforms traditional methods, especially when fine-tuned on SQL-aligned models and with increased candidate numbers.

Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.

[271] The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, Lei Bai

Main category: cs.AI

TL;DR: This survey formalizes the shift from LLM-based RL to Agentic RL, proposing a taxonomy around agent capabilities and applications, with RL as the key mechanism for adaptive behavior.

Details

Motivation: To document and formalize the paradigm shift from treating LLMs as passive sequence generators to viewing them as autonomous decision-making agents in complex environments.

Method: Proposes a twofold taxonomy organized around core agentic capabilities (planning, tool use, memory, reasoning, self-improvement, perception) and their applications across diverse task domains, with RL as the unifying mechanism.

Result: Consolidates over 500 recent works into a practical compendium of open-source environments, benchmarks, and frameworks, charting the contours of this rapidly evolving field.

Conclusion: Agentic RL represents a significant paradigm shift with opportunities and challenges for developing scalable, general-purpose AI agents, with RL serving as the critical mechanism for transforming static capabilities into adaptive agentic behavior.

Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

[272] Improving Robustness of AlphaZero Algorithms to Test-Time Environment Changes

Isidoro Tamassia, Wendelin Böhmer

Main category: cs.AI

TL;DR: The paper analyzes deploying AlphaZero agents in changed test environments and proposes modifications to boost performance with low planning budgets.

Details

Motivation: AlphaZero assumes static environments from training to test, limiting its applicability when environments change.

Method: Combination of simple modifications to the standard AlphaZero framework to handle environment changes.

Result: Significant performance improvements even with low planning budgets available.

Conclusion: Modified AlphaZero framework enables effective deployment in potentially changed test environments.

Abstract: The AlphaZero framework provides a standard way of combining Monte Carlo planning with prior knowledge provided by a previously trained policy-value neural network. AlphaZero usually assumes that the environment on which the neural network was trained will not change at test time, which constrains its applicability. In this paper, we analyze the problem of deploying AlphaZero agents in potentially changed test environments and demonstrate how the combination of simple modifications to the standard framework can significantly boost performance, even in settings with a low planning budget available. The code is publicly available on GitHub.

[273] Towards a Common Framework for Autoformalization

Agnieszka Mensfelt, David Tena Cucala, Santiago Franco, Angeliki Koutsoukou-Argyraki, Vince Trencsenyi, Kostas Stathis

Main category: cs.AI

TL;DR: This paper reviews autoformalization - the automation of translating informal input into formal logical representations - across different fields and proposes a unified framework to bridge research gaps.

Details

Motivation: The rapid development of autoformalization driven by LLMs has created independent research areas with limited shared methodologies, benchmarks, and frameworks, hindering progress.

Method: The paper reviews both explicit and implicit instances of autoformalization across different fields and proposes a unified framework to connect these research areas.

Result: The review identifies that autoformalization research spans mathematics formalization, reasoning, planning, and knowledge representation, but lacks cross-field collaboration.

Conclusion: A unified framework for autoformalization is proposed to encourage cross-pollination between fields and accelerate development of next-generation AI systems.

Abstract: Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit

instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems.

[274] p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

Runyan Tan, Shuang Wu, Phillip Howard

Main category: cs.AI

TL;DR: p-less sampling is a hyperparameter-free decoding strategy that dynamically sets truncation thresholds based on token probability distributions, outperforming existing methods across various tasks while maintaining quality at higher temperatures.

Details

Motivation: Existing sampling methods for LLMs are sensitive to hyperparameter choices and require different settings for different tasks and temperature configurations, limiting their robustness and ease of use.

Method: Information-theoretic approach that dynamically sets truncation thresholds at each decoding step using the entire token probability distribution, eliminating the need for hyperparameters.

Result: Consistently outperforms existing sampling approaches across math, logical reasoning, and creative writing tasks; maintains text quality at higher temperatures; achieves greater inference-time efficiency through lower average token sampling times and shorter generation lengths without sacrificing accuracy.

Conclusion: p-less sampling provides a robust, hyperparameter-free alternative to existing decoding strategies that maintains high output quality across temperature variations while improving efficiency.

Abstract: Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments. The code is available at https://github.com/ryttry/p-less .

[275] When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Main category: cs.AI

TL;DR: CoG framework improves LRM safety by fixing unsafe reasoning steps while maintaining reasoning ability, solving the safety-reasoning trade-off.

Details

Motivation: LRMs have strong reasoning but are vulnerable to safety risks like harmful content and jailbreak attacks. Existing methods suppress reasoning ability and fail to balance safety and reasoning.

Method: Chain-of-Guardrail (CoG) framework that recomposes or backtracks unsafe reasoning steps to steer models back to safe trajectories while preserving valid reasoning chains.

Result: CoG substantially improves safety across multiple benchmarks while preserving comparable reasoning ability, outperforming prior methods that suffer from severe trade-offs.

Conclusion: LRMs inherently can reject unsafe queries but this ability is compromised; CoG successfully resolves the safety-reasoning trade-off by fixing unsafe reasoning steps.

Abstract: Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks but remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks. Existing mitigation strategies rely on injecting heuristic safety signals during training, which often suppress reasoning ability and fail to resolve the safety-reasoning trade-off. To systematically investigate this issue, we analyze the reasoning trajectories of diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models override their own risk assessments and justify responding to unsafe prompts. This finding reveals that LRMs inherently possess the ability to reject unsafe queries, but this ability is compromised, resulting in harmful outputs. Building on these insights, we propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps, steering the model back onto safe trajectories while preserving valid reasoning chains. Extensive experiments across multiple reasoning and safety benchmarks demonstrate that CoG substantially improves the safety of current LRMs while preserving comparable reasoning ability, significantly outperforming prior methods that suffer from severe safety-reasoning trade-offs.

[276] Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber

Main category: cs.AI

TL;DR: The paper identifies a mismatch between coding benchmark performance and self-improvement potential in AI agents, proposes a new metric called CMP to measure metaproductivity, and introduces the Huxley-Gödel Machine (HGM) that uses CMP estimation to guide self-modification searches, achieving superior performance with fewer computational resources.

Details

Motivation: Current self-improving coding agents focus on maximizing benchmark performance, but this doesn't necessarily indicate better potential for future self-modifications. There's a mismatch between performance and actual self-improvement capability (metaproductivity).

Method: Proposed a new metric CMP that aggregates descendant benchmark performances to measure self-improvement potential. Developed the Huxley-Gödel Machine (HGM) which estimates CMP and uses it to guide searches through the tree of self-modifications.

Result: HGM outperforms prior methods on SWE-bench Verified and Polyglot while using fewer CPU hours. Shows strong transfer to other coding datasets and LLMs. The optimized agent achieves human-level performance on SWE-bench Lite with GPT-5, matching best human-engineered coding agents.

Conclusion: The CMP metric effectively captures self-improvement potential, and HGM demonstrates that guiding self-modification searches using metaproductivity estimation leads to more efficient and effective self-improving coding agents.

Abstract: Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent’s self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley’s concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G"odel Machine would behave under certain assumptions. We introduce the Huxley-G"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using fewer allocated CPU hours. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is publicly available at https://github.com/metauto-ai/HGM.

[277] A Survey of AI Scientists: Surveying the automatic Scientists and Research

Guiyao Tie, Pan Zhou, Lichao Sun

Main category: cs.AI

TL;DR: Survey of AI scientists - autonomous systems that perform end-to-end scientific workflows from hypothesis to publication, analyzed through a six-stage framework and evolutionary timeline.

Details

Motivation: The rapid proliferation of AI scientist systems has created a fragmented research landscape, obscuring methodological principles and developmental trends that need systematic synthesis.

Method: Introduces a unified six-stage methodological framework (Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, Paper Generation) and analyzes field evolution from Foundational Modules (2022-2023) to Closed-Loop Systems (2024) to current Scalability/Impact/Human-AI Collaboration (2025-present).

Result: Provides systematic synthesis of autonomous science domain, clarifying current state and charting evolutionary trajectory of AI scientist systems.

Conclusion: The survey offers a critical roadmap for overcoming challenges in robustness and governance, guiding next-generation systems toward becoming trustworthy partners in human scientific inquiry.

Abstract: Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.

[278] Why Foundation Models in Pathology Are Failing

Hamid R. Tizhoosh

Main category: cs.AI

TL;DR: Current pathology foundation models fail to deliver expected breakthroughs due to conceptual mismatches with tissue complexity, showing poor accuracy, robustness, and safety issues.

Details

Motivation: To examine why foundation models that revolutionized other domains are underperforming in computational pathology despite high expectations for cancer diagnosis and prognostication.

Method: Systematic evaluation of pathology foundation models identifying seven key causes of failure through conceptual analysis of model assumptions versus tissue complexity.

Result: Found fundamental weaknesses including low diagnostic accuracy, poor robustness, geometric instability, computational inefficiency, and safety vulnerabilities in current pathology foundation models.

Conclusion: Current pathology foundation models are conceptually misaligned with tissue morphology and require fundamental paradigm rethinking rather than incremental improvements.

Abstract: In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.

[279] The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

Ali Aouad, Aymane El Gadarri, Vivek F. Farias

Main category: cs.AI

TL;DR: The paper proposes a ‘sign estimator’ method that replaces cross-entropy with binary classification loss in LLM alignment, providing consistent ordinal alignment and better performance than standard RLHF.

Details

Motivation: Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences and yield inconsistent estimates of population-average utility.

Method: Propose a sign estimator that replaces cross-entropy with binary classification loss in the aggregation step, providing simple and efficient estimation.

Result: The sign estimator reduces preference distortion by 35% in angular estimation error and decreases disagreement with true population preferences from 12% to 8% compared to standard RLHF.

Conclusion: The sign estimator provides consistent ordinal alignment under mild assumptions, achieves polynomial finite-sample error bounds, and outperforms panel data heuristics while maintaining implementation simplicity.

Abstract: Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.

[280] VDSAgents: A PCS-Guided Multi-Agent System for Veridical Data Science Automation

Yunxuan Jiang, Silan Hu, Xiaoning Wang, Yuanyuan Zhang, Xiangyu Chang

Main category: cs.AI

TL;DR: VDSAgents is a multi-agent system that integrates Predictability-Computability-Stability (PCS) principles into LLM-driven data science workflows, outperforming existing end-to-end systems like AutoKaggle and DataInterpreter.

Details

Motivation: Current LLM-driven data science systems lack guidance from scientific principles, limiting their trustworthiness and robustness with real-world datasets.

Method: A multi-agent system implementing modular workflow for data cleaning, feature engineering, modeling, and evaluation, incorporating perturbation analysis, unit testing, and model validation.

Result: VDSAgents consistently outperforms AutoKaggle and DataInterpreter across nine diverse datasets when using DeepSeek-V3 and GPT-4o as backends.

Conclusion: Embedding PCS principles into LLM-driven data science automation is feasible and improves system performance.

Abstract: Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM-driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real-world datasets. This paper provides VDSAgents, a multi-agent system grounded in the Predictability-Computability-Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state-of-the-art end-to-end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek-V3 and GPT-4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM-driven data science automation.

cs.SD

[281] A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection

Yassine El Kheir, Fabian Ritter-Guttierez, Arnab Das, Tim Polzehl, Sebastian Möller

Main category: cs.SD

TL;DR: MultiConvAdapter is a parameter-efficient architecture that integrates parallel convolutional modules within SSL encoders to capture multi-scale temporal artifacts in synthetic speech detection, achieving superior performance with only 1% trainable parameters compared to full fine-tuning.

Details

Motivation: Existing PEFT methods lack specific inductive biases to model multi-scale temporal artifacts in spoofed audio, and full fine-tuning of pre-trained SSL models is computationally demanding.

Method: Introduces MultiConvAdapter with parallel convolutional modules integrated within SSL encoders to simultaneously learn discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions.

Result: Achieves superior performance compared to full fine-tuning and established PEFT methods on five public datasets, using only 3.17M trainable parameters (1% of SSL backbone), substantially reducing computational burden.

Conclusion: MultiConvAdapter effectively addresses the limitations of existing PEFT methods for synthetic speech detection by incorporating multi-scale temporal modeling capabilities while maintaining parameter efficiency.

Abstract: Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.

[282] Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels

Keisuke Imoto

Main category: cs.SD

TL;DR: A multitask learning framework that uses acoustic scene information to construct partial labels for sound event detection, reducing annotation costs while maintaining performance through semi-supervised learning and self-distillation label refinement.

Details

Motivation: Traditional time boundary annotation for sound events is labor-intensive, limiting scalability of strongly supervised learning. Weakly-supervised learning with clip-level labels reduces costs but suffers from performance degradation. Partial label learning offers a cost-effective alternative but remains unexplored in audio analysis.

Method: Proposes a multitask learning framework that jointly performs acoustic scene classification and sound event detection using partial labels constructed from acoustic scene context. Also explores semi-supervised framework combining strong and partial labels, and introduces a self-distillation-based label refinement method.

Result: The approach reduces annotation costs while addressing performance degradation issues in weakly-supervised learning. By leveraging acoustic scene context for partial label construction and incorporating label refinement, it achieves better balance between cost and detection performance.

Conclusion: Using acoustic scene information to construct partial labels for sound event detection provides a practical solution to reduce annotation costs. The proposed multitask framework with semi-supervised learning and self-distillation label refinement effectively balances annotation efficiency and detection performance.

Abstract: Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.

[283] SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

Dharma Teja Donepudi

Main category: cs.SD

TL;DR: SFMS-ALR is an engine-agnostic framework for fluent code-switched speech synthesis that segments text by Unicode script, applies adaptive language identification, and normalizes prosody across languages without requiring model retraining.

Details

Motivation: Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts, making intra-sentence multilingual speech synthesis a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody.

Method: The framework segments input text by Unicode script, applies adaptive language identification to determine each segment’s language and locale, normalizes prosody using sentiment-aware adjustments, and generates a unified SSML representation with appropriate language/voice spans for single TTS request synthesis.

Result: SFMS-ALR demonstrates flexibility, interpretability, and immediate deployability compared to data-driven pipelines like Unicom and Mask LID, requiring no retraining and integrating seamlessly with existing TTS providers.

Conclusion: The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference in code-switched speech generation.

Abstract: Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segments input text by Unicode script, applies adaptive language identification to determine each segment’s language and locale, and normalizes prosody using sentiment-aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate “lang” or “voice” spans and synthesizes the utterance in a single TTS request. Unlike end-to-end multilingual models, SFMS-ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR’s flexibility, interpretability, and immediate deployability. The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.

[284] Studies for : A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model

Chihiro Nagashima, Akira Takahashi, Zhi Zhong, Shusuke Takahashi, Yuki Mitsufuji

Main category: cs.SD

TL;DR: This paper presents Studies for, a generative sound installation that uses AI to create a “new form of archive” by preserving an artist’s style while generating new sounds, trained on 200+ hours of the artist’s past works.

Details

Motivation: To explore AI integration in artistic workflows and create a speculative archival system that preserves an artist's style while generating new content beyond their existing works.

Method: Developed SpecMaskGIT, a lightweight sound generation AI model, trained on over 200 hours of the artist’s past sound artworks, and integrated it into an eight-channel real-time sound installation with artist feedback.

Result: Successfully created an immersive auditory experience that generates new sounds while maintaining the artist’s artistic identity, demonstrating effective Human-AI co-creation in sound art.

Conclusion: Proposes a Human-AI co-creation framework for sound art that enables new possibilities for creating and archiving artistic works beyond an artist’s physical existence, serving as a “new form of archive”.

Abstract: This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a “new form of archive,” which aims to preserve the artistic style of an artist while expanding beyond artists’ past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala’s past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist’s past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist’s artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of “a new form of archive.” We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist’s work beyond their physical existence. Demo page: https://sony.github.io/studies-for/

[285] Controlling Contrastive Self-Supervised Learning with Knowledge-Driven Multiple Hypothesis: Application to Beat Tracking

Antonin Gagnere, Slim Essid, Geoffroy Peeters

Main category: cs.SD

TL;DR: A contrastive self-supervised pre-training method that uses multiple hypotheses for positive sample selection, incorporating domain knowledge to handle ambiguities in music tasks like beat tracking.

Details

Motivation: Address ambiguities in data and problem constraints where multiple equally plausible outcomes exist, such as different rhythmic interpretations in beat tracking that are all potentially valid.

Method: Contrastive self-supervised pre-training with multiple hypotheses about possible positive samples, selected using a knowledge-based scoring function to retain the most plausible ones.

Result: Outperforms existing methods on standard benchmarks when fine-tuned on labeled data.

Conclusion: Integrating domain knowledge with multi-hypothesis selection provides advantages for music representation learning, particularly for ambiguous tasks like beat tracking.

Abstract: Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.

[286] Binaspect – A Python Library for Binaural Audio Analysis, Visualization & Feature Generation

Dan Barry, Davoud Shariat Panah, Alessandro Ragano, Jan Skoglund, Andrew Hines

Main category: cs.SD

TL;DR: Binaspect is an open-source Python library for binaural audio analysis that generates interpretable azimuth maps by clustering time-frequency bins into stable time-azimuth histograms, enabling visualization of multiple sound sources and degradations without requiring head models.

Details

Motivation: To provide researchers and engineers with tools to observe how binaural cues are degraded by codec and renderer design choices, and to generate structured features for machine learning models in quality prediction and spatial audio tasks.

Method: Generates azimuth maps by calculating modified interaural time and level difference spectrograms, then clustering time-frequency bins into stable time-azimuth histogram representations that can show multiple active sources as distinct azimuthal clusters.

Result: The tool successfully demonstrates degradations in bitrate ladders, ambisonic rendering, and VBAP source positioning, clearly revealing how binaural cues are affected by various audio processing stages.

Conclusion: Binaspect provides valuable diagnostic capabilities for binaural audio analysis and generates exportable features suitable for machine learning applications in quality prediction and spatial audio classification, released as open-source with full reproducibility.

Abstract: We present Binaspect, an open-source Python library for binaural audio analysis, visualization, and feature generation. Binaspect generates interpretable “azimuth maps” by calculating modified interaural time and level difference spectrograms, and clustering those time-frequency (TF) bins into stable time-azimuth histogram representations. This allows multiple active sources to appear as distinct azimuthal clusters, while degradations manifest as broadened, diffused, or shifted distributions. Crucially, Binaspect operates blindly on audio, requiring no prior knowledge of head models. These visualizations enable researchers and engineers to observe how binaural cues are degraded by codec and renderer design choices, among other downstream processes. We demonstrate the tool on bitrate ladders, ambisonic rendering, and VBAP source positioning, where degradations are clearly revealed. In addition to their diagnostic value, the proposed representations can be exported as structured features suitable for training machine learning models in quality prediction, spatial audio classification, and other binaural tasks. Binaspect is released under an open-source license with full reproducibility scripts at https://github.com/QxLabIreland/Binaspect.

[287] Efficient Vocal Source Separation Through Windowed Sink Attention

Christodoulos Benetatos, Yongyi Zang, Randal Leistikow

Main category: cs.SD

TL;DR: Replaced full temporal self-attention with windowed sink attention in vocal separation models, achieving 92% of original SDR performance while reducing FLOPs by 44.5x.

Details

Motivation: Full temporal self-attention in vocal separation models has quadratic computational costs with input length, making it inefficient for long audio sequences.

Method: Analyzed pre-trained model attention patterns, found they are highly localized, and replaced full attention with windowed sink attention using small temporal windows and attention sinks.

Result: Fine-tuning from original checkpoint recovered 92% of original SDR performance while reducing computational FLOPs by 44.5 times.

Conclusion: Windowed sink attention is an effective approach for reducing computational costs in vocal separation models while maintaining most of the performance.

Abstract: State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.

[288] Artificial Neural Networks Trained on Noisy Speech Exhibit the McGurk Effect

Lukas Grasse, Matthew S. Tata

Main category: cs.SD

TL;DR: Artificial neural networks trained on audiovisual speech exhibit the McGurk effect without explicit training, especially when trained with noisy speech, suggesting they can model human audiovisual integration.

Details

Motivation: To understand how artificial neural networks can model human audiovisual speech integration and investigate why the McGurk effect emerges in networks trained only on congruent speech.

Method: Tested various ANNs trained on audiovisual speech with incongruent stimuli designed to elicit the McGurk effect, comparing networks trained on clean vs. noisy speech and systematically varying noise levels during training.

Result: Networks trained on congruent speech still showed McGurk percept; training with noisy speech increased visual responses and McGurk responses; moderate noise enhanced integration but extreme noise prevented it.

Conclusion: ANNs can model human audiovisual integration, with noise exposure during training influencing integration development, supporting their use as models for perception and cognition.

Abstract: Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is demonstrated through a phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of illusory intermediate phonemes. Building on a recent framework that proposes how to address developmental ‘why’ questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We show that networks trained entirely on congruent audiovisual speech nevertheless exhibit the McGurk percept. We further investigated ‘why’ by comparing networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to a pronounced increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks, even networks trained only on congruent speech. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.

[289] Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task

Milena Davudova, Ziyuan Cai, Valentina Giunchiglia, Dragos C. Gruia, Giulia Sanguedolce, Adam Hampshire, Fatemeh Geranmayeh

Main category: cs.SD

TL;DR: Fine-tuning Whisper ASR model significantly improves transcription accuracy for stroke patients’ speech and enables prediction of speech quality, but shows limited generalizability to unseen datasets.

Details

Motivation: Current language impairment assessment after stroke is clinician-intensive and not scalable. ASR foundation models could potentially augment human evaluation but their effectiveness for impaired speech is uncertain.

Method: Evaluated Whisper ASR model on stroke patients’ picture-naming task speech. Assessed verbatim transcription accuracy and downstream language function prediction. Fine-tuned the model and tested on both healthy and patient speech.

Result: Baseline Whisper performed poorly on single-word utterances. Fine-tuning reduced Word Error Rate by 87.72% (healthy) and 71.22% (patients). Learned representations achieved F1 Macro scores of 0.74 (healthy) and 0.75 (patients) for speech quality prediction. However, limited generalizability was observed on unseen TORGO dataset.

Conclusion: Foundation models like Whisper, when fine-tuned, show potential for automated speech assessment in stroke rehabilitation, but require adaptation to specific clinical populations due to cross-domain generalization challenges.

Abstract: Detailed assessment of language impairment following stroke remains a cognitively complex and clinician-intensive task, limiting timely and scalable diagnosis. Automatic Speech Recognition (ASR) foundation models offer a promising pathway to augment human evaluation through intelligent systems, but their effectiveness in the context of speech and language impairment remains uncertain. In this study, we evaluate whether Whisper, a state-of-the-art ASR foundation model, can be applied to transcribe and analyze speech from patients with stroke during a commonly used picture-naming task. We assess both verbatim transcription accuracy and the model’s ability to support downstream prediction of language function, which has major implications for outcomes after stroke. Our results show that the baseline Whisper model performs poorly on single-word speech utterances. Nevertheless, fine-tuning Whisper significantly improves transcription accuracy (reducing Word Error Rate by 87.72% in healthy speech and 71.22% in speech from patients). Further, learned representations from the model enable accurate prediction of speech quality (average F1 Macro of 0.74 for healthy, 0.75 for patients). However, evaluations on an unseen (TORGO) dataset reveal limited generalizability, highlighting the inability of Whisper to perform zero-shot transcription of single-word utterances on out-of-domain clinical speech and emphasizing the need to adapt models to specific clinical populations. While challenges remain in cross-domain generalization, these findings highlight the potential of foundation models, when appropriately fine-tuned, to advance automated speech and language assessment and rehabilitation for stroke-related impairments.

[290] Bob’s Confetti: Phonetic Memorization Attacks in Music and Video Generation

Jaechul Roh, Zachary Novack, Yuefeng Peng, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Amir Houmansadr

Main category: cs.SD

TL;DR: APT attack bypasses copyright filters in generative AI by replacing lyrics with phonetic equivalents, causing models to reproduce copyrighted music and video content despite text-based safeguards.

Details

Motivation: To expose fundamental flaws in text-based copyright protection systems for generative AI, which fail to prevent regurgitation of copyrighted material through phonetic memorization.

Method: Adversarial PhoneTic Prompting (APT) replaces iconic lyrics with homophonic alternatives using CMU pronouncing dictionary, preserving acoustic structure while altering meaning, then tests on Lyrics-to-Song and Text-to-Video models.

Result: Leading L2S models regenerate songs with striking melodic/rhythmic similarity to originals, and T2V models reconstruct visual scenes from original music videos despite no visual cues in prompts, showing cross-modal vulnerability.

Conclusion: Models memorize deep structural patterns tied to acoustics, not just text, creating phonetic-to-visual leakage that renders simple copyright filters ineffective and raises security concerns for multimodal AI deployment.

Abstract: Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., “mom’s spaghetti” becomes “Bob’s confetti”), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).

[291] PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching

Ali Vosoughi, Yongyi Zang, Qihui Yang, Nathan Paek, Randal Leistikow, Chenliang Xu

Main category: cs.SD

TL;DR: PromptReverb is a two-stage generative framework that generates high-quality room impulse responses (RIRs) from natural language descriptions, addressing dataset scarcity and acoustic accuracy limitations.

Details

Motivation: Current RIR generation methods face two key limitations: scarcity of full-band RIR datasets and inability to generate acoustically accurate responses from diverse input modalities like natural language.

Method: Two-stage framework: 1) Variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), 2) Conditional diffusion transformer model based on rectified flow matching that generates RIRs from natural language descriptions.

Result: Superior perceptual quality and acoustic accuracy with 8.8% mean RT60 error compared to -37% for baselines, yielding more realistic room-acoustic parameters.

Conclusion: PromptReverb enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.

Abstract: Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.

cs.LG

[292] Fortytwo: Swarm Inference with Peer-Ranked Consensus

Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov

Main category: cs.LG

TL;DR: Fortytwo is a decentralized AI inference protocol that uses swarm intelligence and pairwise ranking consensus to achieve superior performance over majority voting, with strong security against Sybil attacks.

Details

Motivation: Centralized AI faces compute limitations and diminishing returns from large training runs, requiring a horizontally scalable inference solution that can democratize access to high-quality AI.

Method: Uses swarm inference with peer-ranked, reputation-weighted consensus across heterogeneous models, employing pairwise ranking with Bradley-Terry-style aggregation and proof-of-capability for Sybil resistance.

Result: Achieves 85.90% on GPQA Diamond vs 68.69% for majority voting (+17.21pp improvement), shows strong resilience to adversarial prompts (only 0.12% degradation vs 6.20% baseline), and performs well across six benchmarks.

Conclusion: Fortytwo establishes a foundation for decentralized AI systems that democratize access to high-quality inference through collective intelligence while maintaining reliability and security.

Abstract: As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.

[293] From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

Junsoo Oh, Jerry Song, Chulhee Yun

Main category: cs.LG

TL;DR: The paper analyzes weak-to-strong generalization where a strong model trained with weak model supervision outperforms the teacher, focusing on linear CNN to two-layer ReLU CNN transitions with structured data containing varying signal difficulty and noise.

Details

Motivation: To provide formal theoretical analysis of weak-to-strong generalization beyond abstract frameworks, specifically examining how a strong model can outperform its weak teacher when trained on the weak model's labels.

Method: Analyze gradient descent dynamics when a two-layer ReLU CNN (strong) is trained on data labeled by a pretrained linear CNN (weak), using structured data with label-dependent signals of varying difficulty and label-independent noise.

Result: Identifies two regimes: data-scarce (generalization via benign overfitting or failure via harmful overfitting with transition boundary) and data-abundant (generalization through early label correction but performance degradation from overtraining).

Conclusion: Weak-to-strong generalization operates through distinct mechanisms depending on data abundance, with successful generalization requiring careful training dynamics management to avoid performance degradation from overtraining.

Abstract: Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong). We consider structured data composed of label-dependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes – data-scarce and data-abundant – based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the data-scarce regime, generalization occurs via benign overfitting or fails via harmful overfitting, depending on the amount of data, and we characterize the transition boundary. In the data-abundant regime, generalization emerges in the early phase through label correction, but we observe that overtraining can subsequently degrade performance.

[294] Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA

Mingyu Huang, Shasha Zhou, Ke Li

Main category: cs.LG

TL;DR: GraphFLA is a Python framework that constructs and analyzes fitness landscapes from mutagenesis data across DNA, RNA, protein, and other modalities, calculating 20 biologically relevant features to characterize landscape topography for better model evaluation.

Details

Motivation: Existing machine learning model benchmarks for biological sequence-fitness landscapes lack topographical information, which limits interpretation and comparison of model performance beyond averaged scores.

Method: Developed GraphFLA framework that constructs fitness landscapes from mutagenesis data with up to millions of mutants, calculates 20 features characterizing 4 fundamental aspects of landscape topography, and applied it to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP.

Result: Demonstrated utility in interpreting and comparing performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. Released 155 combinatorially complete empirical fitness landscapes with over 2.2 million sequences.

Conclusion: GraphFLA provides a comprehensive framework for analyzing fitness landscape topography, enabling better interpretation and comparison of model performance across diverse biological modalities.

Abstract: Machine learning models increasingly map biological sequence-fitness landscapes to predict mutational effects. Effective evaluation of these models requires benchmarks curated from empirical data. Despite their impressive scales, existing benchmarks lack topographical information regarding the underlying fitness landscapes, which hampers interpretation and comparison of model performance beyond averaged scores. Here, we introduce GraphFLA, a Python framework that constructs and analyzes fitness landscapes from mutagensis data in diverse modalities (e.g., DNA, RNA, protein, and beyond) with up to millions of mutants. GraphFLA calculates 20 biologically relevant features that characterize 4 fundamental aspects of landscape topography. By applying GraphFLA to over 5,300 landscapes from ProteinGym, RNAGym, and CIS-BP, we demonstrate its utility in interpreting and comparing the performance of dozens of fitness prediction models, highlighting factors influencing model accuracy and respective advantages of different models. In addition, we release 155 combinatorially complete empirical fitness landscapes, encompassing over 2.2 million sequences across various modalities. All the codes and datasets are available at https://github.com/COLA-Laboratory/GraphFLA.

[295] Send Less, Save More: Energy-Efficiency Benchmark of Embedded CNN Inference vs. Data Transmission in IoT

Benjamin Karic, Nina Herrmann, Jan Stenkamp, Paula Scharf, Fabian Gieseke, Angela Schwering

Main category: cs.LG

TL;DR: This paper evaluates using compressed CNNs on ESP32-S3 microcontrollers with LPWAN for environmental monitoring, showing 5x energy reduction by performing inference on-device instead of transmitting raw images.

Details

Motivation: Need for energy-efficient IoT devices for long-term environmental monitoring in remote areas with limited power, where data transfer is energy-intensive.

Method: Use compressed CNNs trained on domain-specific datasets on ESP32-S3 microcontrollers with Low Power Wide Area Networks, performing inference on-device and transmitting only results.

Result: On-device CNN inference reduces overall energy consumption by up to 5x compared to sending raw image data, with minimal accuracy reduction from quantization.

Conclusion: Embedded Machine Learning enables IoT applications with reduced carbon footprint capable of autonomous operation in environmental monitoring scenarios.

Abstract: The integration of the Internet of Things (IoT) and Artificial Intelligence offers significant opportunities to enhance our ability to monitor and address ecological changes. As environmental challenges become increasingly pressing, the need for effective remote monitoring solutions is more critical than ever. A major challenge in designing IoT applications for environmental monitoring - particularly those involving image data - is to create energy-efficient IoT devices capable of long-term operation in remote areas with limited power availability. Advancements in the field of Tiny Machine Learning allow the use of Convolutional Neural Networks (CNNs) on resource-constrained, battery-operated microcontrollers. Since data transfer is energy-intensive, performing inference directly on microcontrollers to reduce the message size can extend the operational lifespan of IoT nodes. This work evaluates the use of common Low Power Wide Area Networks and compressed CNNs trained on domain specific datasets on an ESP32-S3. Our experiments demonstrate, among other things, that executing CNN inference on-device and transmitting only the results reduces the overall energy consumption by a factor of up to five compared to sending raw image data. %The compression of the model using Post Training Quantization is accompanied by an acceptable reduction in accuracy of only a few percentage points compared to a non-quantized model. These findings advocate the development of IoT applications with reduced carbon footprint and capable of operating autonomously in environmental monitoring scenarios by incorporating Embedded Machine Learning.

[296] Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

Main category: cs.LG

TL;DR: The paper challenges the common observation that in-distribution (ID) and out-of-distribution (OOD) accuracy are positively correlated, showing this is often an artifact of aggregating heterogeneous OOD examples. Using OODSelect method, they identify coherent subsets where higher ID accuracy actually predicts lower OOD accuracy.

Details

Motivation: To investigate whether the observed positive correlation between ID and OOD accuracy in benchmarks is genuine or an artifact of aggregation, and to uncover potential spurious correlations that are masked by aggregate metrics.

Method: Developed OODSelect, a simple gradient-based method to identify semantically coherent OOD subsets where the positive correlation between ID and OOD accuracy breaks down.

Result: Across widely used distribution shift benchmarks, OODSelect uncovered subsets (sometimes over half of standard OOD sets) where higher ID accuracy predicts lower OOD accuracy, revealing important failure modes obscured by aggregate metrics.

Conclusion: Aggregate metrics can hide significant OOD robustness failures, and the common positive ID-OOD accuracy correlation is often an artifact of heterogeneous OOD example aggregation rather than absence of spurious correlations.

Abstract: Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed “accuracy-on-the-line.” This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

[297] Adaptive EEG-based stroke diagnosis with a GRU-TCN classifier and deep Q-learning thresholding

Shakeel Abdulkareem, Bora Yimenicioglu, Andrea Yang, Khartik Uppalapati, Aneesh Gudipati, Zhaoyang Fan

Main category: cs.LG

TL;DR: An adaptive multitask EEG classifier using GRU-TCN with DQN threshold adaptation achieves 98% accuracy for stroke type classification, outperforming baseline methods.

Details

Motivation: Rapid triage of suspected stroke requires accurate bedside tools; EEG is promising but underused at first contact due to lack of effective classification systems.

Method: 32-channel EEG signals converted to power spectral density features, processed by recurrent-convolutional network (GRU-TCN) for stroke type, lateralization, and severity prediction, with deep Q-network (DQN) for real-time threshold tuning.

Result: Baseline GRU-TCN achieved 89.3% accuracy for stroke type, 96.9% for severity, and 96.7% for lateralization. With DQN adaptation, stroke-type accuracy increased to 98.0% (F1 97.7%). Validated on independent cohort with robust performance.

Conclusion: Adaptive thresholding enables clinically preferred sensitivity-specificity trade-offs, while integrated visualizations support interpretability, making EEG a viable tool for rapid stroke triage.

Abstract: Rapid triage of suspected stroke needs accurate, bedside-deployable tools; EEG is promising but underused at first contact. We present an adaptive multitask EEG classifier that converts 32-channel signals to power spectral density features (Welch), uses a recurrent-convolutional network (GRU-TCN) to predict stroke type (healthy, ischemic, hemorrhagic), hemispheric lateralization, and severity, and applies a deep Q-network (DQN) to tune decision thresholds in real time. Using a patient-wise split of the UCLH Stroke EIT/EEG data set (44 recordings; about 26 acute stroke, 10 controls), the primary outcome was stroke-type performance; secondary outcomes were severity and lateralization. The baseline GRU-TCN reached 89.3% accuracy (F1 92.8%) for stroke type, about 96.9% (F1 95.9%) for severity, and about 96.7% (F1 97.4%) for lateralization. With DQN threshold adaptation, stroke-type accuracy increased to about 98.0% (F1 97.7%). We also tested robustness on an independent, low-density EEG cohort (ZJU4H) and report paired patient-level statistics. Analyses follow STARD 2015 guidance for diagnostic accuracy studies (index test: GRU-TCN+DQN; reference standard: radiology/clinical diagnosis; patient-wise evaluation). Adaptive thresholding shifts the operating point to clinically preferred sensitivity-specificity trade-offs, while integrated scalp-map and spectral visualizations support interpretability.

[298] Topic Analysis with Side Information: A Neural-Augmented LDA Approach

Biyi Fang, Kripa Rajshekhar, Truong Vo, Diego Klabjan

Main category: cs.LG

TL;DR: nnLDA is a neural-augmented topic model that incorporates side information through a neural prior mechanism, outperforming traditional models like LDA in topic coherence and perplexity.

Details

Motivation: Traditional topic models like LDA struggle to integrate auxiliary information such as metadata and user attributes, limiting their expressiveness and personalization capabilities.

Method: nnLDA uses a neural network to generate the prior over topic proportions conditioned on auxiliary features, capturing nonlinear interactions between side information and topic distributions. It employs a stochastic variational Expectation-Maximization algorithm for joint optimization.

Result: Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification tasks.

Conclusion: Combining neural representation learning with probabilistic topic modeling provides significant benefits when side information is available, enhancing model performance and interpretability.

Abstract: Traditional topic models such as Latent Dirichlet Allocation (LDA) have been widely used to uncover latent structures in text corpora, but they often struggle to integrate auxiliary information such as metadata, user attributes, or document labels. These limitations restrict their expressiveness, personalization, and interpretability. To address this, we propose nnLDA, a neural-augmented probabilistic topic model that dynamically incorporates side information through a neural prior mechanism. nnLDA models each document as a mixture of latent topics, where the prior over topic proportions is generated by a neural network conditioned on auxiliary features. This design allows the model to capture complex nonlinear interactions between side information and topic distributions that static Dirichlet priors cannot represent. We develop a stochastic variational Expectation-Maximization algorithm to jointly optimize the neural and probabilistic components. Across multiple benchmark datasets, nnLDA consistently outperforms LDA and Dirichlet-Multinomial Regression in topic coherence, perplexity, and downstream classification. These results highlight the benefits of combining neural representation learning with probabilistic topic modeling in settings where side information is available.

[299] KAN-GCN: Combining Kolmogorov-Arnold Network with Graph Convolution Network for an Accurate Ice Sheet Emulator

Zesheng Liu, YoungHyun Koo, Maryam Rahnemoonfar

Main category: cs.LG

TL;DR: KAN-GCN is an ice sheet modeling emulator that combines Kolmogorov-Arnold Networks (KAN) with Graph Convolution Networks (GCN) to improve accuracy and efficiency.

Details

Motivation: To create a fast and accurate emulator for ice sheet modeling that improves feature conditioning and nonlinear encoding without increasing computational complexity.

Method: Places a KAN as a feature-wise calibrator before GCNs, applying learnable one-dimensional warps and linear mixing to improve feature conditioning. Tested on 36 melting-rate simulations with 3 mesh-size settings for Pine Island Glacier.

Result: KAN-GCN matches or exceeds accuracy of pure GCN and MLP-GCN baselines across 2- to 5-layer architectures. Improves inference throughput on coarser meshes with only modest cost on finest meshes.

Conclusion: KAN-first designs offer favorable accuracy vs. efficiency trade-off for large transient scenario sweeps in ice sheet modeling.

Abstract: We introduce KAN-GCN, a fast and accurate emulator for ice sheet modeling that places a Kolmogorov-Arnold Network (KAN) as a feature-wise calibrator before graph convolution networks (GCNs). The KAN front end applies learnable one-dimensional warps and a linear mixing step, improving feature conditioning and nonlinear encoding without increasing message-passing depth. We employ this architecture to improve the performance of emulators for numerical ice sheet models. Our emulator is trained and tested using 36 melting-rate simulations with 3 mesh-size settings for Pine Island Glacier, Antarctica. Across 2- to 5-layer architectures, KAN-GCN matches or exceeds the accuracy of pure GCN and MLP-GCN baselines. Despite a small parameter overhead, KAN-GCN improves inference throughput on coarser meshes by replacing one edge-wise message-passing layer with a node-wise transform; only the finest mesh shows a modest cost. Overall, KAN-first designs offer a favorable accuracy vs. efficiency trade-off for large transient scenario sweeps.

[300] WBT-BGRL: A Non-Contrastive Weighted Bipartite Link Prediction Model for Inductive Learning

Joel Frank Huarayo Quispe, Lilian Berton, Didier Vega-Oliveros

Main category: cs.LG

TL;DR: WBT-BGRL is a non-contrastive framework for link prediction in bipartite graphs that enhances bootstrapped learning with a novel weighting mechanism in triplet loss, showing competitive performance in inductive settings.

Details

Motivation: Link prediction in bipartite graphs is crucial for applications like recommendation systems but is less studied than monopartite graphs. Existing methods struggle with inefficient negative sampling in contrastive approaches and limited testing in inductive, weighted, and bipartite scenarios.

Method: Proposed Weighted Bipartite Triplet-Bootstrapped Graph Latents (WBT-BGRL) - a non-contrastive framework using dual GCN encoders with a novel weighting mechanism in triplet loss for bipartite graphs.

Result: Evaluation on real-world datasets (Industry and E-commerce) shows competitive performance against adapted state-of-the-art models (T-BGRL, BGRL, GBT, CCA-SSG), especially when weighting is applied during pretraining.

Conclusion: Weighted, non-contrastive learning is valuable for inductive link prediction in bipartite graphs, with the proposed WBT-BGRL framework demonstrating effectiveness in real-world applications.

Abstract: Link prediction in bipartite graphs is crucial for applications like recommendation systems and failure detection, yet it is less studied than in monopartite graphs. Contrastive methods struggle with inefficient and biased negative sampling, while non-contrastive approaches rely solely on positive samples. Existing models perform well in transductive settings, but their effectiveness in inductive, weighted, and bipartite scenarios remains untested. To address this, we propose Weighted Bipartite Triplet-Bootstrapped Graph Latents (WBT-BGRL), a non-contrastive framework that enhances bootstrapped learning with a novel weighting mechanism in the triplet loss. Using a bipartite architecture with dual GCN encoders, WBT-BGRL is evaluated against adapted state-of-the-art models (T-BGRL, BGRL, GBT, CCA-SSG). Results on real-world datasets (Industry and E-commerce) show competitive performance, especially when weighting is applied during pretraining-highlighting the value of weighted, non-contrastive learning for inductive link prediction in bipartite graphs.

[301] Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

Mohammadreza Doostmohammadian, Zulfiya R. Gabidullina, Hamid R. Rabiee

Main category: cs.LG

TL;DR: A co-optimization framework for distributed machine learning that simultaneously optimizes CPU resource allocation and local model training across computing nodes, with convergence guarantees and significant performance improvements over existing methods.

Details

Motivation: Address the growing demand for fast, computationally efficient, and scalable AI solutions by optimizing computing resources for distributed ML, particularly handling time-varying networks and resource constraints.

Method: Proposes a co-optimization algorithm that assigns CPU usage while training nodes locally with distributed data, supports time-varying networks with balanced weights, ensures all-time feasibility, and handles log-scale quantization in communication channels.

Result: The algorithm achieves more than 50% improvement in cost optimality gap compared to existing CPU scheduling solutions, with proven convergence using perturbation theory, Lyapunov stability, and eigen-spectrum analysis.

Conclusion: The proposed framework successfully addresses the dual challenge of resource optimization and distributed ML training, providing an efficient and scalable solution with strong theoretical guarantees and practical performance improvements.

Abstract: In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than $50%$.

[302] Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song

Main category: cs.LG

TL;DR: LLMs often generate decorative reasoning steps in Chain-of-Thought that don’t actually contribute to predictions, with only 2.3% of steps having high causal influence.

Details

Motivation: To investigate whether Chain-of-Thought reasoning steps truly reflect LLMs' internal thinking processes or are merely decorative.

Method: Proposed True Thinking Score (TTS) to measure causal influence of each reasoning step, and identified TrueThinking direction in latent space for steering model behavior.

Result: Only a small fraction of CoT steps (2.3% in AIME dataset) have high causal impact; decorative thinking is common; steering along TrueThinking direction can force internal reasoning.

Conclusion: LLMs often verbalize reasoning steps without performing them internally, undermining both reasoning efficiency and CoT trustworthiness.

Abstract: Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model’s internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don’t truly contribute to LLMs’ prediction. We measure the step-wise causal influence of each reasoning step on the model’s final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model’s prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.

[303] Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning

Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu

Main category: cs.LG

TL;DR: Proposes a quantitative analysis framework for multimodal imbalance using Modality Gap and GMM modeling, with an adaptive loss function and two-stage training that achieves SOTA results.

Details

Motivation: Address the problem of multimodal imbalance where dominant modalities steer gradient updates, and lack of quantitative analysis and exploitation of imbalance information in existing methods.

Method: Define Modality Gap as Softmax score difference, model distribution with bimodal GMM, estimate posterior probabilities using Bayes’ theorem, and design adaptive loss with two-stage training (warm-up and adaptive phases).

Result: Achieves state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with GMM-identified high-quality samples further improves results.

Conclusion: The proposed framework effectively addresses multimodal imbalance through quantitative analysis and adaptive loss, demonstrating the value of high-quality samples for multimodal fusion.

Abstract: The heterogeneity of multimodal data leads to inconsistencies and imbalance, allowing a dominant modality to steer gradient updates. Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance or conduct its quantitative analysis. To address this gap, we propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function. We define the Modality Gap as the Softmax score difference between modalities for the correct class and model its distribution using a bimodal Gaussian Mixture Model(GMM), representing balanced and imbalanced samples. Using Bayes’ theorem, we estimate each sample’s posterior probability of belonging to these two groups. Based on this, our adaptive loss (1) minimizes the overall Modality Gap, (2) aligns imbalanced samples with balanced ones, and (3) adaptively penalizes each according to its imbalance degree. A two-stage training strategy-warm-up and adaptive phases,yields state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with high-quality samples identified by the GMM further improves results, highlighting their value for effective multimodal fusion.

[304] Finding Culture-Sensitive Neurons in Vision-Language Models

Xiutian Zhao, Rochelle Choenni, Rohit Saxena, Ivan Titov

Main category: cs.LG

TL;DR: The paper identifies culture-sensitive neurons in vision-language models that show preferential sensitivity to culturally specific inputs, and demonstrates their importance for culturally diverse visual question answering.

Details

Motivation: Vision-language models struggle with culturally situated inputs, so the authors want to understand how these models process culturally grounded information and identify neurons responsible for cultural sensitivity.

Method: Used CVQA benchmark to identify culture-selective neurons, performed causal tests by deactivating neurons, and proposed Contrastive Activation Selection (CAS) method for neuron identification. Analyzed three VLMs across 25 cultural groups.

Result: Found neurons whose ablation disproportionately harms performance on questions about corresponding cultures while minimally affecting others. CAS method outperformed existing probability- and entropy-based methods. Culture-sensitive neurons tend to cluster in certain decoder layers.

Conclusion: The study reveals the internal organization of multimodal representations and demonstrates the existence of specialized neurons for cultural processing in vision-language models.

Abstract: Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.

[305] Resource-Efficient and Robust Inference of Deep and Bayesian Neural Networks on Embedded and Analog Computing Platforms

Bernhard Klein

Main category: cs.LG

TL;DR: This paper presents a co-design approach combining algorithmic efficiency and hardware optimization to enable resource-efficient and robust inference for both conventional and Bayesian neural networks on embedded platforms.

Details

Motivation: Growing computational demands of machine learning constrain scalability on resource-limited platforms, while neural networks need reliable predictions under distributional shifts. Bayesian neural networks provide uncertainty quantification but add computational overhead.

Method: Joint pursuit of algorithmic efficiency (model compression, approximate Bayesian inference) and hardware efficiency (digital accelerators, analog hardware). Specific contributions include Galen for layer-specific compression, modeling analog device imperfections with noisy training, probabilistic inference approximations, and probabilistic photonic computing.

Result: Developed methods for automatic compression guided by sensitivity analysis, improved robustness to analog noise, efficient probabilistic inference approximations, and energy-efficient probabilistic inference directly in hardware.

Conclusion: Efficiency and reliability can be advanced jointly through algorithm-hardware co-design, laying foundation for next-generation trustworthy, energy-efficient machine learning systems.

Abstract: While modern machine learning has transformed numerous application domains, its growing computational demands increasingly constrain scalability and efficiency, particularly on embedded and resource-limited platforms. In practice, neural networks must not only operate efficiently but also provide reliable predictions under distributional shifts or unseen data. Bayesian neural networks offer a principled framework for quantifying uncertainty, yet their computational overhead further compounds these challenges. This work advances resource-efficient and robust inference for both conventional and Bayesian neural networks through the joint pursuit of algorithmic and hardware efficiency. The former reduces computation through model compression and approximate Bayesian inference, while the latter optimizes deployment on digital accelerators and explores analog hardware, bridging algorithmic design and physical realization. The first contribution, Galen, performs automatic layer-specific compression guided by sensitivity analysis and hardware-in-the-loop feedback. Analog accelerators offer efficiency gains at the cost of noise; this work models device imperfections and extends noisy training to nonstationary conditions, improving robustness and stability. A second line of work advances probabilistic inference, developing analytic and ensemble approximations that replace costly sampling, integrate into a compiler stack, and optimize embedded inference. Finally, probabilistic photonic computing introduces a paradigm where controlled analog noise acts as an intrinsic entropy source, enabling fast, energy-efficient probabilistic inference directly in hardware. Together, these studies demonstrate how efficiency and reliability can be advanced jointly through algorithm-hardware co-design, laying the foundation for the next generation of trustworthy, energy-efficient machine-learning systems.

[306] Sequences of Logits Reveal the Low Rank Structure of Language Models

Noah Golowich, Allen Liu, Abhishek Shetty

Main category: cs.LG

TL;DR: Language models exhibit low-rank structure in their logit matrices, enabling generation of responses using linear combinations of outputs from unrelated prompts.

Details

Motivation: To understand the inherent low-dimensional structure of large language models at a model-agnostic level as sequential probabilistic models.

Method: Empirical demonstration of low-rank structure in logit matrices across various language models, and theoretical analysis using approximate rank as a universal abstraction.

Result: Low-rank structure allows generating responses to target prompts using linear combinations of outputs from unrelated or nonsensical prompts.

Conclusion: Language models possess inherent low-dimensional structure that can be leveraged for generation and provides theoretical learning guarantees.

Abstract: A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model’s logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation – in particular, we can generate a response to a target prompt using a linear combination of the model’s outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.

[307] Conformational Rank Conditioned Committees for Machine Learning-Assisted Directed Evolution

Mia Adler, Carrie Liang, Brian Peng, Oleg Presnyakov, Justin M. Baker, Jannelle Lauffer, Himani Sharma, Barry Merriman

Main category: cs.LG

TL;DR: A rank-conditioned committee framework for ML-assisted directed evolution that separates conformational uncertainty from epistemic uncertainty in antibody fitness landscapes.

Details

Motivation: Current MLDE pipelines using single conformations or single committees cannot properly separate conformational uncertainty from epistemic uncertainty, limiting their effectiveness in antibody discovery.

Method: Introduces a rank-conditioned committee (RCC) framework that assigns deep neural network committees per conformational rank, enabling principled separation of uncertainties.

Result: Validated on SARS-CoV-2 antibody docking with significant improvements over baseline strategies.

Conclusion: Provides a scalable route for therapeutic antibody discovery while directly addressing conformational uncertainty modeling challenges.

Abstract: Machine Learning-assisted directed evolution (MLDE) is a powerful tool for efficiently navigating antibody fitness landscapes. Many structure-aware MLDE pipelines rely on a single conformation or a single committee across all conformations, limiting their ability to separate conformational uncertainty from epistemic uncertainty. Here, we introduce a rank -conditioned committee (RCC) framework that leverages ranked conformations to assign a deep neural network committee per rank. This design enables a principled separation between epistemic uncertainty and conformational uncertainty. We validate our approach on SARS-CoV-2 antibody docking, demonstrating significant improvements over baseline strategies. Our results offer a scalable route for therapeutic antibody discovery while directly addressing the challenge of modeling conformational uncertainty.

[308] Partially Observable Multi-Agent Reinforcement Learning with Information Sharing

Xiangyu Liu, Kaiqing Zhang

Main category: cs.LG

TL;DR: This paper studies provable multi-agent reinforcement learning in partially observable stochastic games (POSGs) by leveraging information-sharing among agents to achieve quasi-polynomial time and sample complexity.

Details

Motivation: To circumvent the known hardness results and computational intractability in POSGs, the authors advocate leveraging information-sharing practices common in empirical multi-agent RL and multi-agent control systems with communication.

Method: The paper proposes approximating shared common information to construct an approximate model of POSG, where approximate equilibria can be found in quasi-polynomial time. It also develops a partially observable multi-agent RL algorithm with quasi-polynomial complexities.

Result: The approach achieves quasi-polynomial time and sample complexities for finding approximate equilibria in POSGs, and extends to finding team-optimal solutions in cooperative POSGs (decentralized POMDPs) with established computational and sample complexities.

Conclusion: The study opens up possibilities for leveraging and designing different information structures from control theory to develop sample- and computation-efficient partially observable multi-agent RL.

Abstract: We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.

[309] Strategic inputs: feature selection from game-theoretic perspective

Chi Zhao, Jing Liu, Elena Parilina

Main category: cs.LG

TL;DR: An end-to-end feature selection framework using game theory to reduce computational costs in machine learning while maintaining performance.

Details

Motivation: Exponential data growth increases computational costs in ML training, with many features not contributing to performance but consuming resources.

Method: Formulates feature selection as a cooperative game where features are players, evaluating synergistic interactions and marginal contributions through four components: sample selection, game-theoretic importance evaluation, redundant feature elimination, and optimized model training.

Result: Achieves substantial computation reduction while preserving predictive performance, offering efficient solution for large-scale ML computational challenges.

Conclusion: The game theory-based feature selection framework effectively addresses computational challenges in large-scale machine learning by identifying and eliminating non-contributing features.

Abstract: The exponential growth of data volumes has led to escalating computational costs in machine learning model training. However, many features fail to contribute positively to model performance while consuming substantial computational resources. This paper presents an end-to-end feature selection framework for tabular data based on game theory. We formulate feature selection procedure based on a cooperative game where features are modeled as players, and their importance is determined through the evaluation of synergistic interactions and marginal contributions. The proposed framework comprises four core components: sample selection, game-theoretic feature importance evaluation, redundant feature elimination, and optimized model training. Experimental results demonstrate that the proposed method achieves substantial computation reduction while preserving predictive performance, thereby offering an efficient solution of the computational challenges of large-scale machine learning. The source code is available at https://github.com/vectorsss/strategy_inputs.

[310] HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

Kale-ab Abebe Tessera, Arrasy Rahman, Amos Storkey, Stefano V. Albrecht

Main category: cs.LG

TL;DR: HyperMARL uses agent-conditioned hypernetworks to generate agent-specific parameters, addressing gradient interference in multi-agent reinforcement learning while preserving behavioral diversity without complex objectives or manual diversity settings.

Details

Motivation: Parameter sharing in MARL suppresses behavioral diversity needed for specialization due to cross-agent gradient interference, which is worsened by coupling agent IDs with observations. Existing solutions add complexity through altered objectives or manual diversity settings.

Method: Proposes HyperMARL using agent-conditioned hypernetworks to generate agent-specific parameters, decoupling observation- and agent-conditioned gradients to directly counter interference from coupling agent IDs with observations.

Result: Across 22 MARL scenarios with up to 30 agents, HyperMARL achieves competitive performance with six key baselines while preserving behavioral diversity comparable to non-parameter sharing methods, and empirically reduces policy gradient variance.

Conclusion: HyperMARL establishes a versatile and principled approach for adaptive MARL that avoids complexities of prior work while maintaining performance and behavioral diversity.

Abstract: Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates – raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.

[311] LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

Ximan Sun, Xiang Cheng

Main category: cs.LG

TL;DR: LRT-Diffusion introduces a risk-aware sampling method for diffusion policies in offline RL that uses sequential hypothesis testing to provide calibrated risk control during inference, improving the return-OOD trade-off while maintaining training simplicity.

Details

Motivation: Current diffusion policies for offline RL use heuristic guidance without statistical risk control, lacking interpretable risk budgets and principled uncertainty handling.

Method: Uses log-likelihood ratio tests at each denoising step, gating the conditional mean with a logistic controller calibrated to meet user-specified Type-I error levels. Training remains vanilla DDPM with two heads, while guidance composes with Q-gradients.

Result: On D4RL MuJoCo tasks, LRT-Diffusion improves return-OOD trade-off over Q-guided baselines while honoring desired alpha levels. Provides better performance especially when off-support errors dominate.

Conclusion: LRT-Diffusion is a drop-in inference-time method that adds principled, calibrated risk control to diffusion policies without changing training, offering interpretable risk budgets and improved safety-performance balance.

Abstract: Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.

[312] Epileptic Seizure Detection and Prediction from EEG Data: A Machine Learning Approach with Clinical Validation

Ria Jayanti, Tanish Jain

Main category: cs.LG

TL;DR: This paper presents a machine learning approach for both seizure detection and prediction using EEG data, achieving 90.9% detection accuracy with Logistic Regression and 89.26% prediction accuracy with LSTM networks.

Details

Motivation: Traditional seizure detection approaches only identify seizures after they begin, limiting opportunities for early intervention. The study aims to develop a more proactive approach that can predict seizures before they occur.

Method: Used CHB-MIT Scalp EEG Database with 969 hours of recordings from 23 patients. For detection: implemented K-Nearest Neighbors, Logistic Regression, Random Forest, and SVM. For prediction: employed Long Short-Term Memory (LSTM) networks to model temporal dependencies.

Result: Logistic Regression achieved 90.9% detection accuracy with 89.6% recall. Random Forest and SVM achieved 94.0% accuracy but 0% recall. LSTM model achieved 89.26% prediction accuracy for seizure prediction.

Conclusion: The study demonstrates the potential for developing accessible, real-time monitoring tools that can both detect and predict seizures, enabling a shift from reactive to proactive epilepsy management.

Abstract: In recent years, machine learning has become an increasingly powerful tool for supporting seizure detection and monitoring in epilepsy care. Traditional approaches focus on identifying seizures only after they begin, which limits the opportunity for early intervention and proactive treatment. In this study, we propose a novel approach that integrates both real-time seizure detection and prediction, aiming to capture subtle temporal patterns in EEG data that may indicate an upcoming seizure. Our approach was evaluated using the CHB-MIT Scalp EEG Database, which includes 969 hours of recordings and 173 seizures collected from 23 pediatric and young adult patients with drug-resistant epilepsy. To support seizure detection, we implemented a range of supervised machine learning algorithms, including K-Nearest Neighbors, Logistic Regression, Random Forest, and Support Vector Machine. The Logistic Regression achieved 90.9% detection accuracy with 89.6% recall, demonstrating balanced performance suitable for clinical screening. Random Forest and Support Vector Machine models achieved higher accuracy (94.0%) but with 0% recall, failing to detect any seizures, illustrating that accuracy alone is insufficient for evaluating medical ML models with class imbalance. For seizure prediction, we employed Long Short-Term Memory (LSTM) networks, which use deep learning to model temporal dependencies in EEG data. The LSTM model achieved 89.26% prediction accuracy. These results highlight the potential of developing accessible, real-time monitoring tools that not only detect seizures as traditionally done, but also predict them before they occur. This ability to predict seizures marks a significant shift from reactive seizure management to a more proactive approach, allowing patients to anticipate seizures and take precautionary measures to reduce the risk of injury or other complications.

[313] Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series

Hemanath Arumugam, Falong Fan, Bo Liu

Main category: cs.LG

TL;DR: This paper introduces a novel HRL architecture that integrates self-supervised Transformer-based Change Point Detection into the Option-Critic framework to autonomously discover meaningful subgoals and learn optimal option termination boundaries.

Details

Motivation: Hierarchical Reinforcement Learning faces practical challenges in autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries, despite its theoretical appeal for long-horizon tasks.

Method: Integrates a self-supervised Transformer-based Change Point Detection module into Option-Critic framework, using CPD to segment state trajectories and discover options through heuristic pseudo-labels from intrinsic signals. Uses change-points for termination function stabilization, intra-option policy pretraining, and inter-option divergence penalties.

Result: Experiments on Four-Rooms and Pinball tasks show CPD-guided agents achieve accelerated convergence, higher cumulative returns, and significantly improved option specialization compared to standard approaches.

Conclusion: Integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments, enabling autonomous discovery of reusable, semantically meaningful skills.

Abstract: Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.

[314] What Really Matters in Matrix-Whitening Optimizers?

Kevin Frans, Pieter Abbeel, Sergey Levine

Main category: cs.LG

TL;DR: Matrix-whitening optimizers outperform elementwise methods like Adam, with variance adaptation being the key overlooked factor rather than just spectral normalization.

Details

Motivation: To systematically deconstruct matrix-whitening optimizers and identify the key components that explain their superior performance compared to elementwise counterparts.

Method: Systematic analysis of various matrix-whitening optimizers through experiments with tuned hyperparameters, comparing performance gains and examining the roles of spectral normalization versus variance adaptation.

Result: All matrix-whitening methods reliably outperform elementwise optimizers like Adam. Variance adaptation, not just accurate spectral normalization, is the primary factor explaining performance gains. Variance-adapted versions consistently outperform sign-descent counterparts.

Conclusion: Matrix-whitening serves dual purposes, with variance adaptation being the overlooked critical component for performance improvements. Low-rank variance estimators can effectively reduce memory costs without performance loss.

Abstract: A range of recent optimizers have emerged that approximate the same “matrix-whitening” transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent – however, experiments reveal that performance gains are not explained solely by accurate spectral normalization – particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap. Experiments show that variance-adapted versions of optimizers consistently outperform their sign-descent counterparts, including an adaptive version of Muon. We further ablate variance adaptation strategies, finding that while lookahead style approximations are not as effective, low-rank variance estimators can effectively reduce memory costs without a performance loss.

[315] Disentangling Shared and Private Neural Dynamics with SPIRE: A Latent Modeling Framework for Deep Brain Stimulation

Rahil Soroushmojdehi, Sina Javadzadeh, Mehrnaz Asadi, Terence D. Sanger

Main category: cs.LG

TL;DR: SPIRE is a deep multi-encoder autoencoder that disentangles shared network dynamics from region-specific activity in multi-region neural data, showing robust performance on synthetic benchmarks and practical applications in DBS recordings.

Details

Motivation: To address the challenge of disentangling shared network-level dynamics from region-specific activity in multi-region neural data modeling.

Method: Introduces SPIRE (Shared-Private Inter-Regional Encoder), a deep multi-encoder autoencoder with novel alignment and disentanglement losses that factorizes recordings into shared and private latent subspaces.

Result: SPIRE robustly recovers cross-regional structure, outperforms classical probabilistic models on synthetic benchmarks, and reveals how external perturbations reorganize neural dynamics. In DBS recordings, it shows shared latents reliably encode stimulation-specific signatures that generalize across sites and frequencies.

Conclusion: SPIRE establishes as a practical, reproducible tool for analyzing multi-region neural dynamics under stimulation.

Abstract: Disentangling shared network-level dynamics from region-specific activity is a central challenge in modeling multi-region neural data. We introduce SPIRE (Shared-Private Inter-Regional Encoder), a deep multi-encoder autoencoder that factorizes recordings into shared and private latent subspaces with novel alignment and disentanglement losses. Trained solely on baseline data, SPIRE robustly recovers cross-regional structure and reveals how external perturbations reorganize it. On synthetic benchmarks with ground-truth latents, SPIRE outperforms classical probabilistic models under nonlinear distortions and temporal misalignments. Applied to intracranial deep brain stimulation (DBS) recordings, SPIRE shows that shared latents reliably encode stimulation-specific signatures that generalize across sites and frequencies. These results establish SPIRE as a practical, reproducible tool for analyzing multi-region neural dynamics under stimulation.

[316] Machine Learning based Analysis for Radiomics Features Robustness in Real-World Deployment Scenarios

Sarmad Ahmad Khan, Simon Bernatz, Zahra Moslehi, Florian Buettner

Main category: cs.LG

TL;DR: Radiomics ML models are vulnerable to distribution shifts from imaging protocol variations. Protocol-invariant features maintain >0.85 F1-scores across shifts, while all features show 40% performance degradation. Dataset augmentation improves uncertainty estimates and reduces calibration error by 35%.

Details

Motivation: Radiomics-based machine learning models show promise for clinical decision support but are vulnerable to distribution shifts caused by variations in imaging protocols, positioning, and segmentation.

Method: Used a phantom of 16 fruits to evaluate distribution shifts across 5 MRI sequences. Trained XGBoost classifiers on protocol-invariant features vs sequence-specific features, testing under in-domain and out-of-domain conditions with segmentation variations and inter-observer variability.

Result: Models trained on protocol-invariant features maintained F1-scores >0.85 across distribution shifts, while models using all features showed 40% performance degradation under protocol changes. Dataset augmentation reduced expected calibration error by 35% without sacrificing accuracy.

Conclusion: Protocol-aware feature selection and controlled phantom studies effectively predict model behavior under distribution shifts, providing a framework for developing robust radiomics models resilient to real-world protocol variations.

Abstract: Radiomics-based machine learning models show promise for clinical decision support but are vulnerable to distribution shifts caused by variations in imaging protocols, positioning, and segmentation. This study systematically investigates the robustness of radiomics-based machine learning models under distribution shifts across five MRI sequences. We evaluated how different acquisition protocols and segmentation strategies affect model reliability in terms of predictive power and uncertainty-awareness. Using a phantom of 16 fruits, we evaluated distribution shifts through: (1) protocol variations across T2-HASTE, T2-TSE, T2-MAP, T1-TSE, and T2-FLAIR sequences; (2) segmentation variations (full, partial, rotated); and (3) inter-observer variability. We trained XGBoost classifiers on 8 consistent robust features versus sequence-specific features, testing model performance under in-domain and out-of-domain conditions. Results demonstrate that models trained on protocol-invariant features maintain F1-scores >0.85 across distribution shifts, while models using all features showed 40% performance degradation under protocol changes. Dataset augmentation substantially improved the quality of uncertainty estimates and reduced the expected calibration error (ECE) by 35% without sacrificing accuracy. Temperature scaling provided minimal calibration benefits, confirming XGBoost’s inherent reliability. Our findings reveal that protocol-aware feature selection and controlled phantom studies effectively predict model behavior under distribution shifts, providing a framework for developing robust radiomics models resilient to real-world protocol variations.

[317] Graph Distance Based on Cause-Effect Estimands with Latents

Zhufeng Li, Niki Kilbertus

Main category: cs.LG

TL;DR: Proposes a new graph distance measure for ADMGs that evaluates causal discovery methods based on how graph differences affect cause-effect estimation under unobserved confounding.

Details

Motivation: Current evaluation of causal discovery methods is difficult, especially under latent confounding, making it hard to assess real progress in the field.

Method: Uses identification via fixing and a symbolic verifier to quantify how graph differences distort cause-effect estimands for different treatment-outcome pairs.

Result: The proposed measure analyzes behavior under graph perturbations and compares against existing distance metrics.

Conclusion: Provides a principled evaluation framework for causal discovery methods based on downstream causal effect estimation performance.

Abstract: Causal discovery aims to recover graphs that represent causal relations among given variables from observations, and new methods are constantly being proposed. Increasingly, the community raises questions about how much progress is made, because properly evaluating discovered graphs remains notoriously difficult, particularly under latent confounding. We propose a graph distance measure for acyclic directed mixed graphs (ADMGs) based on the downstream task of cause-effect estimation under unobserved confounding. Our approach uses identification via fixing and a symbolic verifier to quantify how graph differences distort cause-effect estimands for different treatment-outcome pairs. We analyze the behavior of the measure under different graph perturbations and compare it against existing distance metrics.

[318] Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training

Zhifeng Wang, Longlong Li, Chunyan Zeng

Main category: cs.LG

TL;DR: DWMGrad is a novel optimization algorithm that uses dynamic guidance from historical data to adapt momentum and learning rates, achieving faster convergence and higher accuracy than SGD and Adam.

Details

Motivation: Current optimization algorithms like SGD and Adam struggle with learning efficiency fluctuations, complex models, and non-convex optimization problems due to limitations in handling complex data structures, learning rate selection, avoiding local optima, and navigating high-dimensional spaces.

Method: DWMGrad builds on traditional methods by incorporating a dynamic guidance mechanism that uses historical data to dynamically update momentum and learning rates, allowing flexible adjustment of reliance on historical information to adapt to different training scenarios.

Result: Extensive experimentation shows DWMGrad achieves faster convergence rates and higher accuracies across multiple scenarios compared to traditional optimization methods.

Conclusion: The dynamic guidance mechanism in DWMGrad enables better adaptation to changing environments and task complexities, making it a superior optimization algorithm for deep learning applications.

Abstract: Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms’ limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad’s ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.

[319] Training Across Reservoirs: Using Numerical Differentiation To Couple Trainable Networks With Black-Box Reservoirs

Andrew Clark, Jack Moursounidis, Osmaan Rasouli, William Gan, Cooper Doyle, Anna Leontjeva

Main category: cs.LG

TL;DR: BOND is a perturbative method for estimating partial derivatives in networks with inaccessible computational graphs, enabling integration of black-box functions that improve model performance without adding trainable parameters.

Details

Motivation: To enable exploration of trainable architectures that integrate black-box functions, particularly for networks where computational graphs are inaccessible, and to leverage fixed modules to expand model capacity.

Method: Bounded Numerical Differentiation (BOND) - a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs.

Result: BOND shows improved accuracy and scalability over existing perturbative methods. Black-box functions (implemented as fixed, untrained networks) enhance model performance without increasing trainable parameters, without requiring extensive optimization.

Conclusion: The findings highlight the potential of using fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices for network scaling.

Abstract: We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating partial derivatives across network structures with inaccessible computational graphs. BOND demonstrates improved accuracy and scalability from existing perturbative methods, enabling new explorations of trainable architectures that integrate black-box functions. We observe that these black-box functions, realized in our experiments as fixed, untrained networks, can enhance model performance without increasing the number of trainable parameters. This improvement is achieved without extensive optimization of the architecture or properties of the black-box function itself. Our findings highlight the potential of leveraging fixed, non-trainable modules to expand model capacity, suggesting a path toward combining analogue and digital devices as a mechanism for scaling networks.

[320] Continual Low-Rank Adapters for LLM-based Generative Recommender Systems

Hyunsik Yoo, Ting-Wei Li, SeongKu Kang, Zhining Liu, Charlie Xu, Qilin Qi, Hanghang Tong

Main category: cs.LG

TL;DR: PESO is a continual adaptation method for LoRA in recommendation that uses proximal regularization to balance adaptation to new user preferences while preventing harmful influence from outdated preferences.

Details

Motivation: Existing LoRA-based continual learning methods focus on preserving past performance, but in recommendation, predicting past preferences is not the goal and outdated preferences can harm performance when user interests shift significantly.

Method: PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling flexible balance between adaptation and preservation while capturing recent user behaviors. This provides data-aware, direction-wise guidance in the LoRA subspace.

Result: Empirically, PESO consistently outperforms existing LoRA-based continual learning methods in recommendation tasks.

Conclusion: The proximal regularization approach in PESO effectively addresses the unique challenges of continual learning in recommendation systems by enabling adaptive preservation that focuses on recent user behaviors rather than outdated preferences.

Abstract: While large language models (LLMs) achieve strong performance in recommendation, they face challenges in continual learning as users, items, and user preferences evolve over time. Existing LoRA-based continual methods primarily focus on preserving performance on previous tasks, but this overlooks the unique nature of recommendation: the goal is not to predict past preferences, and outdated preferences can even harm performance when current interests shift significantly. To address this, we propose PESO (Proximally rEgularized Single evolving lOra, a continual adaptation method for LoRA in recommendation. PESO introduces a proximal regularizer that anchors the current adapter to its most recent frozen state, enabling the model to flexibly balance adaptation and preservation, and to better capture recent user behaviors. Theoretically, we show that this proximal design provides data-aware, direction-wise guidance in the LoRA subspace. Empirically, PESO consistently outperforms existing LoRA-based continual learning methods.

[321] Learning Fair Graph Representations with Multi-view Information Bottleneck

Chuxun Liu, Debo Cheng, Qingfeng Chen, Jiangzhang Gan, Jiuyong Li, Lin Liu

Main category: cs.LG

TL;DR: FairMIB is a multi-view information bottleneck framework that decomposes graphs into feature, structural, and diffusion views to mitigate complexity biases in GNNs, achieving superior fairness-utility trade-offs.

Details

Motivation: GNNs amplify training data biases by propagating discriminatory attributes and structural imbalances, while existing fairness methods treat bias as a single source, ignoring distinct attribute and structure effects.

Method: Uses multi-view decomposition with contrastive learning to maximize cross-view mutual information, integrates multi-perspective conditional information bottleneck objectives, and employs inverse probability-weighted adjacency correction in diffusion view.

Result: Achieves state-of-the-art performance across both utility and fairness metrics on five real-world benchmark datasets.

Conclusion: FairMIB effectively addresses multiple sources of bias in GNNs through its multi-view framework and achieves optimal fairness-utility balance.

Abstract: Graph neural networks (GNNs) excel on relational data by passing messages over node features and structure, but they can amplify training data biases, propagating discriminatory attributes and structural imbalances into unfair outcomes. Many fairness methods treat bias as a single source, ignoring distinct attribute and structure effects and leading to suboptimal fairness and utility trade-offs. To overcome this challenge, we propose FairMIB, a multi-view information bottleneck framework designed to decompose graphs into feature, structural, and diffusion views for mitigating complexity biases in GNNs. Especially, the proposed FairMIB employs contrastive learning to maximize cross-view mutual information for bias-free representation learning. It further integrates multi-perspective conditional information bottleneck objectives to balance task utility and fairness by minimizing mutual information with sensitive attributes. Additionally, FairMIB introduces an inverse probability-weighted (IPW) adjacency correction in the diffusion view, which reduces the spread of bias propagation during message passing. Experiments on five real-world benchmark datasets demonstrate that FairMIB achieves state-of-the-art performance across both utility and fairness metrics.

[322] Shift is Good: Mismatched Data Mixing Improves Test Performance

Marko Medvedev, Kaifeng Lyu, Zhiyuan Li, Nathan Srebro

Main category: cs.LG

TL;DR: Distribution shift between training and test mixtures can be beneficial, improving test performance even without transfer between components, with optimal training proportions identified.

Details

Motivation: To challenge the conventional view that distribution shift between training and test data is always harmful, and to explore scenarios where mismatched training proportions can actually improve test performance.

Method: Analysis of mixture distributions with different training and test proportions, examining various scenarios to identify optimal training proportions and quantify benefits of distribution shift.

Result: Distribution shift can be beneficial in many settings, test performance can improve due to mismatched training proportions even without transfer between components, and optimal training proportions are identified.

Conclusion: Distribution shift between training and test mixtures can be generically beneficial, with identified optimal training proportions that improve test performance, and the analysis extends to compositional settings with different skill distributions.

Abstract: We consider training and testing on mixture distributions with different training and test proportions. We show that in many settings, and in some sense generically, distribution shift can be beneficial, and test performance can improve due to mismatched training proportions, even if the components are unrelated and with no transfer between components. In a variety of scenarios, we identify the optimal training proportions and the extent to which such distribution shift can be beneficial. We show how the same analysis applies also to a compositional setting with differing distribution of component “skills’’ at training and test.

[323] The Neural Differential Manifold: An Architecture with Explicit Geometric Structure

Di Zhang

Main category: cs.LG

TL;DR: The paper introduces Neural Differential Manifold (NDM), a neural network architecture that incorporates geometric structure by modeling networks as differentiable manifolds with Riemannian metrics, enabling geometric regularization and interpretable representations.

Details

Motivation: To move beyond conventional Euclidean parameter spaces and incorporate explicit geometric structure into neural network design for better generalization, robustness, and interpretability.

Method: Three-layer architecture: Coordinate Layer (invertible chart transitions), Geometric Layer (dynamic metric generation via sub-networks), and Evolution Layer (dual-objective optimization with geometric regularization).

Result: The framework enables natural gradient descent aligned with learned geometry, provides intrinsic regularization through curvature and volume distortion penalties, and offers unprecedented interpretability of internal representations.

Conclusion: NDM represents a fundamental shift towards geometrically structured, interpretable deep learning with potential for efficient optimization, continual learning, and scientific applications, though computational challenges remain.

Abstract: This paper introduces the Neural Differential Manifold (NDM), a novel neural network architecture that explicitly incorporates geometric structure into its fundamental design. Departing from conventional Euclidean parameter spaces, the NDM re-conceptualizes a neural network as a differentiable manifold where each layer functions as a local coordinate chart, and the network parameters directly parameterize a Riemannian metric tensor at every point. The architecture is organized into three synergistic layers: a Coordinate Layer implementing smooth chart transitions via invertible transformations inspired by normalizing flows, a Geometric Layer that dynamically generates the manifold’s metric through auxiliary sub-networks, and an Evolution Layer that optimizes both task performance and geometric simplicity through a dual-objective loss function. This geometric regularization penalizes excessive curvature and volume distortion, providing intrinsic regularization that enhances generalization and robustness. The framework enables natural gradient descent optimization aligned with the learned manifold geometry and offers unprecedented interpretability by endowing internal representations with clear geometric meaning. We analyze the theoretical advantages of this approach, including its potential for more efficient optimization, enhanced continual learning, and applications in scientific discovery and controllable generative modeling. While significant computational challenges remain, the Neural Differential Manifold represents a fundamental shift towards geometrically structured, interpretable, and efficient deep learning systems.

[324] A Unified Bilevel Model for Adversarial Learning and A Case Study

Yutong Zheng, Qingna Li

Main category: cs.LG

TL;DR: A unified bilevel model for adversarial learning is proposed, focusing on clustering models. The paper analyzes how data perturbation affects clustering robustness and introduces the δ-measure to quantify attack effects.

Details

Motivation: The motivation is to better understand and interpret adversarial attacks in machine learning models, particularly clustering, and to develop methods to measure attack effects since current mechanisms are not well understood.

Method: The authors propose a unified bilevel model for adversarial learning and investigate adversarial attacks in clustering models from a data perturbation perspective. They analyze the δ-measure for quantifying attack effects.

Result: The research reveals that clustering models are robust to small data perturbations but vulnerable to larger perturbations that change clustering results. The δ-measure is shown to be well-defined for measuring attack effects in the proposed bilevel model.

Conclusion: The paper concludes that understanding data perturbation effects is crucial for adversarial learning in clustering, and the proposed δ-measure provides a valid way to quantify attack impacts within the unified bilevel framework.

Abstract: Adversarial learning has been attracting more and more attention thanks to the fast development of machine learning and artificial intelligence. However, due to the complicated structure of most machine learning models, the mechanism of adversarial attacks is not well interpreted. How to measure the effect of attack is still not quite clear. In this paper, we propose a unified bilevel model for adversarial learning. We further investigate the adversarial attack in clustering models and interpret it from data perturbation point of view. We reveal that when the data perturbation is relatively small, the clustering model is robust, whereas if it is relatively large, the clustering result changes, which leads to an attack. To measure the effect of attacks for clustering models, we analyse the well-definedness of the so-called $\delta$-measure, which can be used in the proposed bilevel model for adversarial learning of clustering models.

[325] Learning Low Rank Neural Representations of Hyperbolic Wave Dynamics from Data

Woojin Cho, Kookjin Lee, Noseong Park, Donsub Rim, Gerrit Welper

Main category: cs.LG

TL;DR: A data-driven dimensionality reduction method using low rank neural representation (LRNR) for hyperbolic wave propagation, which learns efficient low-dimensional representations from data and enables interpretable physical feature decomposition.

Details

Motivation: To develop efficient representations for physics-based hyperbolic wave propagation data, motivated by theoretical proofs of efficient representations for this wave class.

Method: Utilizes specialized LRNR neural network architecture within a hypernetwork framework, combining deep learning techniques to learn low-dimensional representations directly from data.

Result: Learned low rank tensor representation naturally emerges, revealing interpretable physical feature decomposition in wave propagation, and enables efficient inference via compression.

Conclusion: LRNR architecture successfully learns efficient low-dimensional representations for hyperbolic wave propagation with interpretable physical features and practical compression benefits.

Abstract: We present a data-driven dimensionality reduction method that is well-suited for physics-based data representing hyperbolic wave propagation. The method utilizes a specialized neural network architecture called low rank neural representation (LRNR) inside a hypernetwork framework. The architecture is motivated by theoretical results that rigorously prove the existence of efficient representations for this wave class. We illustrate through archetypal examples that such an efficient low-dimensional representation of propagating waves can be learned directly from data through a combination of deep learning techniques. We observe that a low rank tensor representation arises naturally in the trained LRNRs, and that this reveals a new decomposition of wave propagation where each decomposed mode corresponds to interpretable physical features. Furthermore, we demonstrate that the LRNR architecture enables efficient inference via a compression scheme, which is a potentially important feature when deploying LRNRs in demanding performance regimes.

[326] Bridging the Divide: End-to-End Sequence-Graph Learning

Yuen Chen, Yulun Wu, Samuel Sharpe, Igor Melnyk, Nam H. Nguyen, Furong Huang, C. Bayan Bruss, Rizal Fathony

Main category: cs.LG

TL;DR: BRIDGE is a unified architecture that jointly models sequential and relational data by coupling sequence encoders with graph neural networks, enabling token-level message passing between neighboring sequences.

Details

Motivation: Real-world datasets often contain both sequential events and relational interactions, but existing methods typically handle only one modality. The authors argue sequences and graphs are complementary and should be learned jointly.

Method: BRIDGE combines sequence encoders with GNNs under a single objective, allowing gradient flow across both modules. It introduces TOKENXATTN for token-level cross-attention message passing between events in neighboring sequences.

Result: BRIDGE consistently outperforms static GNNs, temporal graph methods, and sequence-only baselines on ranking and classification metrics across friendship prediction (Brightkite) and fraud detection (Amazon) tasks.

Conclusion: Jointly modeling sequences and graphs through unified architectures like BRIDGE provides superior performance compared to approaches that treat these modalities separately.

Abstract: Many real-world datasets are both sequential and relational: each node carries an event sequence while edges encode interactions. Existing methods in sequence modeling and graph modeling often neglect one modality or the other. We argue that sequences and graphs are not separate problems but complementary facets of the same dataset, and should be learned jointly. We introduce BRIDGE, a unified end-to-end architecture that couples a sequence encoder with a GNN under a single objective, allowing gradients to flow across both modules and learning task-aligned representations. To enable fine-grained token-level message passing among neighbors, we add TOKENXATTN, a token-level cross-attention layer that passes messages between events in neighboring sequences. Across two settings, friendship prediction (Brightkite) and fraud detection (Amazon), BRIDGE consistently outperforms static GNNs, temporal graph methods, and sequence-only baselines on ranking and classification metrics.

[327] An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

Uzair Akbar, Niki Kilbertus, Hao Shen, Krikamol Muandet, Bo Dai

Main category: cs.LG

TL;DR: This paper proposes using data augmentation as instrumental variable-like interventions to improve causal effect estimation and generalization across interventions, especially when dealing with hidden confounders.

Details

Motivation: Traditional data augmentation is used for regularization in i.i.d. settings, but the authors want to extend its use for causal inference and generalization across interventions, particularly when instrumental variables are not readily available.

Method: The authors introduce IV-like (IVL) regression that regularizes IV-based estimators, treat data augmentation as interventions on treatment mechanisms, and use parameterized DA compositions to simulate worst-case scenarios for improved performance.

Result: The approach shows improved performance in causal estimation and generalization tasks, both theoretically for population cases and empirically through simulations and real data experiments.

Conclusion: Data augmentation can be effectively used beyond i.i.d. settings as IV-like interventions to mitigate confounding bias and improve causal effect estimation, offering better performance than simple data augmentation alone.

Abstract: The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) – sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

[328] Lipschitz-aware Linearity Grafting for Certified Robustness

Yongjin Han, Suhyun Kim

Main category: cs.LG

TL;DR: Linearity grafting into non-linear activation functions reduces approximation errors and tightens local Lipschitz constants, improving certified robustness without certified training.

Details

Motivation: Existing over-approximation methods for neural network verification suffer from approximation errors that prevent obtaining tight local Lipschitz constants, which are crucial for certified robustness against adversarial examples.

Method: Proposed Lipschitz-aware linearity grafting method that replaces non-linear activation functions with linear ones to eliminate approximation errors, since linear functions don’t require relaxation.

Result: Extensive experiments show that grafting linearity into influential activations tightens the l∞ local Lipschitz constant and enhances certified robustness.

Conclusion: Linearity grafting improves certified robustness by eliminating dominant approximation errors and providing tighter local Lipschitz constants, with theoretical justification for why this approach works.

Abstract: Lipschitz constant is a fundamental property in certified robustness, as smaller values imply robustness to adversarial examples when a model is confident in its prediction. However, identifying the worst-case adversarial examples is known to be an NP-complete problem. Although over-approximation methods have shown success in neural network verification to address this challenge, reducing approximation errors remains a significant obstacle. Furthermore, these approximation errors hinder the ability to obtain tight local Lipschitz constants, which are crucial for certified robustness. Originally, grafting linearity into non-linear activation functions was proposed to reduce the number of unstable neurons, enabling scalable and complete verification. However, no prior theoretical analysis has explained how linearity grafting improves certified robustness. We instead consider linearity grafting primarily as a means of eliminating approximation errors rather than reducing the number of unstable neurons, since linear functions do not require relaxation. In this paper, we provide two theoretical contributions: 1) why linearity grafting improves certified robustness through the lens of the $l_\infty$ local Lipschitz constant, and 2) grafting linearity into non-linear activation functions, the dominant source of approximation errors, yields a tighter local Lipschitz constant. Based on these theoretical contributions, we propose a Lipschitz-aware linearity grafting method that removes dominant approximation errors, which are crucial for tightening the local Lipschitz constant, thereby improving certified robustness, even without certified training. Our extensive experiments demonstrate that grafting linearity into these influential activations tightens the $l_\infty$ local Lipschitz constant and enhances certified robustness.

[329] Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk

Weimin Huang, Ryan Piansky, Bistra Dilkina, Daniel K. Molzahn

Main category: cs.LG

TL;DR: ML-guided framework for Optimal Power Shutoff problems that quickly produces high-quality de-energization decisions by exploiting shared patterns across instances and integrating domain knowledge.

Details

Motivation: To rapidly solve computationally challenging Mixed-Integer Linear Programs for wildfire risk management through power line de-energization, leveraging shared structure across problem instances.

Method: Extends existing ML-guided MILP solution methods by integrating domain knowledge about the number of energized and de-energized lines, exploiting patterns across OPS instances with varying wildfire risks, loads, and renewable generation.

Result: The proposed ML-guided method produces high-quality solutions faster than traditional optimization methods on a large-scale realistic California-based synthetic test system.

Conclusion: Machine learning can effectively accelerate solution of Optimal Power Shutoff problems for wildfire risk management while maintaining solution quality.

Abstract: To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.

[330] Selective Learning for Deep Time Series Forecasting

Yisong Fu, Zezhi Shao, Chengqing Yu, Yujie Li, Zhulin An, Qi Wang, Yongjun Xu, Fei Wang

Main category: cs.LG

TL;DR: Proposes a selective learning strategy for deep time series forecasting that screens timesteps using dual-mask mechanism to prevent overfitting by focusing on generalizable timesteps and ignoring uncertain/anomalous ones.

Details

Motivation: Deep learning models for time series forecasting suffer from severe overfitting due to vulnerability to noise and anomalies, as they uniformly optimize all timesteps without distinction.

Method: Dual-mask selective learning: uncertainty mask using residual entropy to filter uncertain timesteps, and anomaly mask using residual lower bound estimation to exclude anomalous timesteps.

Result: Significant performance improvements across 8 real-world datasets: 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.

Conclusion: Selective learning strategy effectively mitigates overfitting in deep time series forecasting models by focusing optimization on generalizable timesteps.

Abstract: Benefiting from high capacity for capturing complex temporal patterns, deep learning (DL) has significantly advanced time series forecasting (TSF). However, deep models tend to suffer from severe overfitting due to the inherent vulnerability of time series to noise and anomalies. The prevailing DL paradigm uniformly optimizes all timesteps through the MSE loss and learns those uncertain and anomalous timesteps without difference, ultimately resulting in overfitting. To address this, we propose a novel selective learning strategy for deep TSF. Specifically, selective learning screens a subset of the whole timesteps to calculate the MSE loss in optimization, guiding the model to focus on generalizable timesteps while disregarding non-generalizable ones. Our framework introduces a dual-mask mechanism to target timesteps: (1) an uncertainty mask leveraging residual entropy to filter uncertain timesteps, and (2) an anomaly mask employing residual lower bound estimation to exclude anomalous timesteps. Extensive experiments across eight real-world datasets demonstrate that selective learning can significantly improve the predictive performance for typical state-of-the-art deep models, including 37.4% MSE reduction for Informer, 8.4% for TimesNet, and 6.5% for iTransformer.

[331] Cost-Sensitive Unbiased Risk Estimation for Multi-Class Positive-Unlabeled Learning

Miao Zhang, Junpeng Li, Changchun Hua, Yana Yang

Main category: cs.LG

TL;DR: Proposes a cost-sensitive multi-class PU learning method with adaptive loss weighting for unbiased risk estimation when only positive and unlabeled data are available.

Details

Motivation: Multi-class PU learning remains challenging as many existing approaches don't ensure unbiased risk estimation, limiting performance and stability in real applications where annotating reliable negatives is difficult.

Method: Uses adaptive loss weighting within empirical risk minimization framework, assigning distinct data-dependent weights to positive and inferred-negative loss components to create an unbiased estimator of target risk.

Result: Extensive experiments on eight public datasets show consistent gains over strong baselines in both accuracy and stability across varying class priors and numbers of classes.

Conclusion: The proposed cost-sensitive multi-class PU method with adaptive loss weighting provides an effective solution for unbiased risk estimation in multi-class PU learning scenarios.

Abstract: Positive–Unlabeled (PU) learning considers settings in which only positive and unlabeled data are available, while negatives are missing or left unlabeled. This situation is common in real applications where annotating reliable negatives is difficult or costly. Despite substantial progress in PU learning, the multi-class case (MPU) remains challenging: many existing approaches do not ensure \emph{unbiased risk estimation}, which limits performance and stability. We propose a cost-sensitive multi-class PU method based on \emph{adaptive loss weighting}. Within the empirical risk minimization framework, we assign distinct, data-dependent weights to the positive and \emph{inferred-negative} (from the unlabeled mixture) loss components so that the resulting empirical objective is an unbiased estimator of the target risk. We formalize the MPU data-generating process and establish a generalization error bound for the proposed estimator. Extensive experiments on \textbf{eight} public datasets, spanning varying class priors and numbers of classes, show consistent gains over strong baselines in both accuracy and stability.

[332] BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training

Wenjie Zhou, Bohan Wang, Wei Chen, Xueqi Cheng

Main category: cs.LG

TL;DR: BSFA is a plug-and-play framework that accelerates deep learning training by differentially scaling parameter updates in Dom-space (top eigendirections) and Bulk-space (orthogonal component), achieving 2x speedup on LLaMA models.

Details

Motivation: Recent studies show that parameter updates along top Hessian eigendirections (Dom-space) have large magnitude but minimal loss reduction, while updates in orthogonal Bulk-space drive most learning progress despite smaller magnitudes.

Method: BSFA differentially scales update components in Dom-space and Bulk-space, using PCA on historical updates for efficient subspace estimation and a block-wise strategy for scalability.

Result: BSFA achieves approximately 2x speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.

Conclusion: BSFA provides an effective and scalable approach to accelerate deep learning training by focusing on the most impactful update directions while maintaining stability.

Abstract: Recent studies \citep{gur2018gradient,song2024does, wen2024understanding} highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress. In this work, we further advance the understanding of this phenomenon and introduce the \textbf{Bulk-Space-Filtration-Accelerator (BSFA)}, a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space. To ensure BSFA is both practical and scalable for contemporary large models, we introduce two key innovations: an efficient estimator using Principal Component Analysis (PCA) on historical updates for fast subspace estimation, and a block-wise strategy that applies this estimation on a per-parameter-block basis. These designs make BSFA computationally tractable and highly effective. We demonstrate BSFA’s acceleration across various tasks, notably achieving approximately 2$\times$ speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.

[333] Scaling Up Bayesian DAG Sampling

Daniele Nikzad, Alexander Zhilkin, Juha Harviainen, Jack Kuipers, Giusi Moffa, Mikko Koivisto

Main category: cs.LG

TL;DR: Efficient techniques for Bayesian network structure sampling: optimized basic moves and parent set pruning for faster convergence.

Details

Motivation: Bayesian network structure inference via Markov chain sampling is computationally expensive, requiring improvements to basic operations and parent set computations.

Method: Two techniques: 1) Efficient implementation of basic graph moves (add/delete/reverse arcs), 2) Preprocessing method to prune possible parent sets while preserving sums.

Result: Empirical study shows substantial efficiency gains compared to previous methods.

Conclusion: The proposed techniques significantly improve sampling efficiency for Bayesian network structure inference.

Abstract: Bayesian inference of Bayesian network structures is often performed by sampling directed acyclic graphs along an appropriately constructed Markov chain. We present two techniques to improve sampling. First, we give an efficient implementation of basic moves, which add, delete, or reverse a single arc. Second, we expedite summing over parent sets, an expensive task required for more sophisticated moves: we devise a preprocessing method to prune possible parent sets so as to approximately preserve the sums. Our empirical study shows that our techniques can yield substantial efficiency gains compared to previous methods.

[334] IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

Xiandong Zou, Pan Zhou

Main category: cs.LG

TL;DR: IBNorm is a new normalization method based on Information Bottleneck principle that outperforms traditional variance-centric methods like BatchNorm, LayerNorm, and RMSNorm by encouraging embeddings to preserve predictive information while suppressing nuisance variability.

Details

Motivation: Existing normalization methods focus on stabilizing training through variance control but don't explicitly control how representations capture task-relevant information. The authors aim to develop normalization that better preserves predictive information.

Method: Proposed IB-Inspired Normalization (IBNorm), which introduces bounded compression operations grounded in Information Bottleneck principle to encourage embeddings to preserve predictive information while suppressing nuisance variability.

Result: IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior.

Conclusion: IBNorm provides a more effective normalization approach that yields more informative representations while maintaining the stability and compatibility of standard normalization methods, with theoretical guarantees of higher IB value and tighter generalization bounds.

Abstract: Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

[335] On the Stability of Neural Networks in Deep Learning

Blaise Delattre

Main category: cs.LG

TL;DR: This thesis addresses neural network instability and vulnerability through sensitivity analysis, using Lipschitz networks, curvature regularization, and randomized smoothing to improve robustness and training stability.

Details

Motivation: Deep learning models suffer from instability and vulnerability where small input changes drastically affect predictions, and optimization is hindered by sharp loss landscapes.

Method: Combines Lipschitz networks to constrain input sensitivity, curvature regularization for smoother optimization landscapes, and randomized smoothing for probabilistic robustness at decision boundaries.

Result: Developed a unified framework with theoretical analysis and practical methodologies including efficient spectral norm computation, novel Lipschitz-constrained layers, and improved certification procedures.

Conclusion: The sensitivity analysis perspective provides a unified approach to address fundamental stability challenges in neural networks through architectural constraints and regularization techniques.

Abstract: Deep learning has achieved remarkable success across a wide range of tasks, but its models often suffer from instability and vulnerability: small changes to the input may drastically affect predictions, while optimization can be hindered by sharp loss landscapes. This thesis addresses these issues through the unifying perspective of sensitivity analysis, which examines how neural networks respond to perturbations at both the input and parameter levels. We study Lipschitz networks as a principled way to constrain sensitivity to input perturbations, thereby improving generalization, adversarial robustness, and training stability. To complement this architectural approach, we introduce regularization techniques based on the curvature of the loss function, promoting smoother optimization landscapes and reducing sensitivity to parameter variations. Randomized smoothing is also explored as a probabilistic method for enhancing robustness at decision boundaries. By combining these perspectives, we develop a unified framework where Lipschitz continuity, randomized smoothing, and curvature regularization interact to address fundamental challenges in stability. The thesis contributes both theoretical analysis and practical methodologies, including efficient spectral norm computation, novel Lipschitz-constrained layers, and improved certification procedures.

[336] Hierarchical Physics-Embedded Learning for Spatiotemporal Dynamical Systems

Xizhe Wang, Xiaobin Song, Qingshan Jia, Hongbo Zhao, Benben Jiang

Main category: cs.LG

TL;DR: A hierarchical physics-embedded learning framework for spatiotemporal dynamics that combines data-driven learning with physical knowledge integration, enabling both forward prediction and inverse discovery of governing equations from sparse, noisy data.

Details

Motivation: Traditional PDE modeling is intractable for complex far-from-equilibrium systems due to high-order derivatives, strong nonlinearities, and incomplete physical knowledge. Existing data-driven methods lack physical consistency or structural capacity to handle complex operators.

Method: Two-level architecture: first level learns fundamental symbolic PDE components, second level learns their governing combinations. Uses adaptive Fourier Neural Operators to capture non-local dependencies and high-order operators. Directly embeds known physical laws into computational graphs.

Result: Framework enables physical consistency, improved data efficiency, and interpretable discovery of governing equations through symbolic regression without presupposing functional forms.

Conclusion: The hierarchical decomposition reduces learning complexity and enables systematic integration of prior knowledge, advancing both forward spatiotemporal prediction and inverse discovery of physical laws from limited data.

Abstract: Modeling complex spatiotemporal dynamics, particularly in far-from-equilibrium systems, remains a grand challenge in science. The governing partial differential equations (PDEs) for these systems are often intractable to derive from first principles, due to their inherent complexity, characterized by high-order derivatives and strong nonlinearities, coupled with incomplete physical knowledge. This has spurred the development of data-driven methods, yet these approaches face limitations: Purely data-driven models are often physically inconsistent and data-intensive, while existing physics-informed methods lack the structural capacity to represent complex operators or systematically integrate partial physical knowledge. Here, we propose a hierarchical physics-embedded learning framework that fundamentally advances both the forward spatiotemporal prediction and inverse discovery of physical laws from sparse and noisy data. The key innovation is a two-level architecture that mirrors the process of scientific discovery: the first level learns fundamental symbolic components of a PDE, while the second learns their governing combinations. This hierarchical decomposition not only reduces learning complexity but, more importantly, enables a structural integration of prior knowledge. Known physical laws are directly embedded into the models computational graph, guaranteeing physical consistency and improving data efficiency. By building the framework upon adaptive Fourier Neural Operators, we can effectively capture the non-local dependencies and high-order operators characteristic of dynamical systems. Additionally, by structurally decoupling known and unknown terms, the framework further enables interpretable discovery of underlying governing equations through symbolic regression, without presupposing functional forms.

[337] Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning

Sagalpreet Singh, Rishi Saket, Aravindan Raghuveer

Main category: cs.LG

TL;DR: The paper proposes a novel RL algorithm that learns policies to maximize expected return while ensuring uniform visitation of goal states, addressing the limitation of traditional RL methods that may exploit only a few reward sources.

Details

Motivation: Traditional RL algorithms focus on maximizing expected return but may exploit only one or few reward sources, leading to non-dispersed state distributions. In many natural scenarios, it's desirable to have policies that uniformly visit all goal states while maintaining high returns.

Method: The authors formalize Multi Goal RL and propose an algorithm that learns a policy mixture with dispersed marginal state distribution over goal states. The method uses a custom RL reward computed based on current policy mixture, sampled trajectories, and offline RL updates.

Result: The algorithm provides performance guarantees with efficient convergence bounds for optimizing an objective that captures both expected return and dispersion of marginal state distribution over goal states. Experiments on synthetic MDPs and standard RL environments demonstrate effectiveness.

Conclusion: The proposed approach successfully addresses the challenge of learning policies that maximize return while ensuring uniform visitation of goal states, overcoming limitations of existing entropy regularization and intrinsic reward methods.

Abstract: Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.

[338] CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices

Xuchen Feng, Siyu Liao

Main category: cs.LG

TL;DR: Introduces CDFlow, a normalizing flow using circulant-diagonal matrices to reduce parameter complexity from O(n²) to O(mn) and accelerate matrix inversion from O(n³) to O(mn log n) via FFT.

Details

Motivation: Design efficient invertible linear layers for normalizing flows that maintain expressiveness while enabling fast computation of Jacobian determinants and inverses.

Method: Novel invertible linear layer based on product of circulant and diagonal matrices, leveraging Fast Fourier Transform for efficient computation.

Result: CDFlow achieves strong density estimation on natural images, models periodic data effectively, and significantly accelerates flow operations.

Conclusion: Circulant-diagonal decomposition provides practical benefits for scalable generative modeling by balancing expressiveness with computational efficiency.

Abstract: Normalizing flows are deep generative models that enable efficient likelihood estimation and sampling through invertible transformations. A key challenge is to design linear layers that enhance expressiveness while maintaining efficient computation of the Jacobian determinant and inverse. We introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition reduces parameter complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(mn)$ using $m$ diagonal matrices and $m-1$ circulant matrices while still approximating general linear transformations. By leveraging the Fast Fourier Transform, our approach reduces the time complexity of matrix inversion from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn\log n)$ and that of computing the log-determinant from $\mathcal{O}(n^3)$ to $\mathcal{O}(mn)$, where $n$ is the input dimension. We build upon this layer to develop Circulant-Diagonal Flow (CDFlow), which achieves strong density estimation on natural image datasets and effectively models data with inherent periodic structure. Furthermore, CDFlow significantly accelerates key operations in normalizing flows, providing practical benefits for scalable generative modeling.

[339] Beyond Leakage and Complexity: Towards Realistic and Efficient Information Cascade Prediction

Jie Peng, Rui Wang, Qiang Wang, Zhewei Wei, Bin Tong, Guan Wang

Main category: cs.LG

TL;DR: This paper addresses limitations in information cascade popularity prediction by proposing time-ordered data splitting, introducing a rich e-commerce dataset (Taoke), and developing an efficient framework (CasTemp) that achieves state-of-the-art performance with significant speed improvements.

Details

Motivation: Current cascade prediction methods suffer from three critical issues: temporal leakage in evaluation that allows access to future information, feature-poor datasets lacking downstream conversion signals, and computational inefficiency of complex graph-based methods.

Method: Three-pronged approach: (1) time-ordered splitting strategy for leak-free evaluation, (2) Taoke dataset with rich e-commerce attributes and purchase conversions, (3) CasTemp framework using temporal walks, Jaccard-based neighbor selection, and GRU encoding with time-aware attention.

Result: CasTemp achieves state-of-the-art performance across four datasets under leak-free evaluation with orders-of-magnitude speedup. It particularly excels at predicting second-stage popularity conversions critical for real-world applications.

Conclusion: The proposed solutions systematically address fundamental limitations in cascade prediction, providing both methodological improvements and practical datasets for more realistic and efficient information diffusion analysis.

Abstract: Information cascade popularity prediction is a key problem in analyzing content diffusion in social networks. However, current related works suffer from three critical limitations: (1) temporal leakage in current evaluation–random cascade-based splits allow models to access future information, yielding unrealistic results; (2) feature-poor datasets that lack downstream conversion signals (e.g., likes, comments, or purchases), which limits more practical applications; (3) computational inefficiency of complex graph-based methods that require days of training for marginal gains. We systematically address these challenges from three perspectives: task setup, dataset construction, and model design. First, we propose a time-ordered splitting strategy that chronologically partitions data into consecutive windows, ensuring models are evaluated on genuine forecasting tasks without future information leakage. Second, we introduce Taoke, a large-scale e-commerce cascade dataset featuring rich promoter/product attributes and ground-truth purchase conversions–capturing the complete diffusion lifecycle from promotion to monetization. Third, we develop CasTemp, a lightweight framework that efficiently models cascade dynamics through temporal walks, Jaccard-based neighbor selection for inter-cascade dependencies, and GRU-based encoding with time-aware attention. Under leak-free evaluation, CasTemp achieves state-of-the-art performance across four datasets with orders-of-magnitude speedup. Notably, it excels at predicting second-stage popularity conversions–a practical task critical for real-world applications.

[340] Analysis of Semi-Supervised Learning on Hypergraphs

Adrien Weihs, Andrea Bertozzi, Matthew Thorpe

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of hypergraph learning and proposes HOHL, a method that uses skeleton graph Laplacians for multiscale regularization, showing strong empirical performance.

Details

Motivation: Hypergraphs model higher-order interactions but lack theoretical foundations in semi-supervised learning. The paper aims to establish asymptotic consistency and develop effective learning methods for hypergraphs.

Method: Proposes Higher-Order Hypergraph Learning (HOHL) which regularizes using powers of Laplacians from skeleton graphs to achieve multiscale smoothness. The method converges to a higher-order Sobolev seminorm.

Result: Theoretical analysis shows asymptotic consistency of variational learning on random geometric hypergraphs, with convergence to a weighted p-Laplacian equation. HOHL performs strongly on standard baselines.

Conclusion: The paper establishes theoretical foundations for hypergraph learning and demonstrates that HOHL provides effective regularization through multiscale smoothness, achieving strong empirical results.

Abstract: Hypergraphs provide a natural framework for modeling higher-order interactions, yet their theoretical underpinnings in semi-supervised learning remain limited. We provide an asymptotic consistency analysis of variational learning on random geometric hypergraphs, precisely characterizing the conditions ensuring the well-posedness of hypergraph learning as well as showing convergence to a weighted $p$-Laplacian equation. Motivated by this, we propose Higher-Order Hypergraph Learning (HOHL), which regularizes via powers of Laplacians from skeleton graphs for multiscale smoothness. HOHL converges to a higher-order Sobolev seminorm. Empirically, it performs strongly on standard baselines.

[341] Parameter Averaging in Link Prediction

Rupesh Sapkota, Caglar Demir, Arnab Sharma, Axel-Cyrille Ngonga Ngomo

Main category: cs.LG

TL;DR: Proposes model merging via weighted averaging for knowledge graph embedding models to improve link prediction performance while reducing computational overhead compared to traditional ensemble methods.

Details

Motivation: Traditional ensemble methods for KGE models require training multiple models, which increases computational overhead, latency, and memory usage. Model merging approaches offer a more efficient alternative.

Method: Two weighted averaging approaches: 1) maintaining running average of model parameters from training epoch onward, and 2) selective updating of ensemble parameters only when validation performance improves.

Result: The proposed weighted averaging approach consistently improves performance across link prediction tasks, literal-augmented KGE models, and multi-hop query answering tasks, outperforming state-of-the-art benchmark ensemble approaches.

Conclusion: Weighted averaging is an effective model merging technique for KGE models that provides performance improvements while being computationally more efficient than traditional ensemble methods.

Abstract: Ensemble methods are widely employed to improve generalization in machine learning. This has also prompted the adoption of ensemble learning for the knowledge graph embedding (KGE) models in performing link prediction. Typical approaches to this end train multiple models as part of the ensemble, and the diverse predictions are then averaged. However, this approach has some significant drawbacks. For instance, the computational overhead of training multiple models increases latency and memory overhead. In contrast, model merging approaches offer a promising alternative that does not require training multiple models. In this work, we introduce model merging, specifically weighted averaging, in KGE models. Herein, a running average of model parameters from a training epoch onward is maintained and used for predictions. To address this, we additionally propose an approach that selectively updates the running average of the ensemble model parameters only when the generalization performance improves on a validation dataset. We evaluate these two different weighted averaging approaches on link prediction tasks, comparing the state-of-the-art benchmark ensemble approach. Additionally, we evaluate the weighted averaging approach considering literal-augmented KGE models and multi-hop query answering tasks as well. The results demonstrate that the proposed weighted averaging approach consistently improves performance across diverse evaluation settings.

[342] A Convexity-dependent Two-Phase Training Algorithm for Deep Neural Networks

Tomas Hrycej, Bernhard Bermeitinger, Massimo Pavone, Götz-Henrik Wiegand, Siegfried Handschuh

Main category: cs.LG

TL;DR: The paper proposes a two-phase optimization algorithm that leverages the transition from non-convex to convex regions in loss functions, using Adam for non-convex regions and Conjugate Gradient for convex regions to improve convergence and accuracy.

Details

Motivation: Loss functions in machine learning often have non-convex regions, leading to widespread use of non-convex methods like Adam. However, local minima imply convex neighborhoods where second-order methods can achieve superlinear convergence.

Method: A two-phase algorithm that detects the transition point from non-convex to convex regions by monitoring gradient norm dependence on loss, then switches between Adam (non-convex phase) and Conjugate Gradient (convex phase) accordingly.

Result: Computing experiments confirm that this convexity structure is frequent enough to be practically exploited, leading to substantial improvements in convergence and accuracy.

Conclusion: The proposed framework successfully leverages the natural transition from non-convex to convex regions in loss functions to design an effective hybrid optimization algorithm that outperforms single-method approaches.

Abstract: The key task of machine learning is to minimize the loss function that measures the model fit to the training data. The numerical methods to do this efficiently depend on the properties of the loss function. The most decisive among these properties is the convexity or non-convexity of the loss function. The fact that the loss function can have, and frequently has, non-convex regions has led to a widespread commitment to non-convex methods such as Adam. However, a local minimum implies that, in some environment around it, the function is convex. In this environment, second-order minimizing methods such as the Conjugate Gradient (CG) give a guaranteed superlinear convergence. We propose a novel framework grounded in the hypothesis that loss functions in real-world tasks swap from initial non-convexity to convexity towards the optimum. This is a property we leverage to design an innovative two-phase optimization algorithm. The presented algorithm detects the swap point by observing the gradient norm dependence on the loss. In these regions, non-convex (Adam) and convex (CG) algorithms are used, respectively. Computing experiments confirm the hypothesis that this simple convexity structure is frequent enough to be practically exploited to substantially improve convergence and accuracy.

[343] Position: Biology is the Challenge Physics-Informed ML Needs to Evolve

Julien Martinelli

Main category: cs.LG

TL;DR: The paper proposes Biology-Informed Machine Learning (BIML) as an extension of Physics-Informed Machine Learning (PIML) to address unique challenges in biological modeling, including uncertain prior knowledge, noisy data, and complex networks.

Details

Motivation: To adapt PIML's success in physics to biology, which has different constraints and challenges like multi-faceted prior knowledge, heterogeneous data, partial observability, and high-dimensional networks.

Method: Proposes BIML as a principled extension of PIML that uses softer, probabilistic forms of prior knowledge and outlines four foundational pillars: uncertainty quantification, contextualization, constrained latent structure inference, and scalability.

Result: BIML retools PIML methods to operate under biological constraints, with Foundation Models and Large Language Models serving as key enablers to bridge human expertise with computational modeling.

Conclusion: The paper provides concrete recommendations to build the BIML ecosystem and direct PIML-inspired innovation toward scientifically and socially relevant biological challenges.

Abstract: Physics-Informed Machine Learning (PIML) has successfully integrated mechanistic understanding into machine learning, particularly in domains governed by well-known physical laws. This success has motivated efforts to apply PIML to biology, a field rich in dynamical systems but shaped by different constraints. Biological modeling, however, presents unique challenges: multi-faceted and uncertain prior knowledge, heterogeneous and noisy data, partial observability, and complex, high-dimensional networks. In this position paper, we argue that these challenges should not be seen as obstacles to PIML, but as catalysts for its evolution. We propose Biology-Informed Machine Learning (BIML): a principled extension of PIML that retains its structural grounding while adapting to the practical realities of biology. Rather than replacing PIML, BIML retools its methods to operate under softer, probabilistic forms of prior knowledge. We outline four foundational pillars as a roadmap for this transition: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. Foundation Models and Large Language Models will be key enablers, bridging human expertise with computational modeling. We conclude with concrete recommendations to build the BIML ecosystem and channel PIML-inspired innovation toward challenges of high scientific and societal relevance.

[344] A Deep Learning Framework for Multi-Operator Learning: Architectures and Approximation Theory

Adrien Weihs, Jingmin Sun, Zecheng Zhang, Hayden Schaeffer

Main category: cs.LG

TL;DR: This paper studies neural operator learning for multiple operators, distinguishing between learning parameterized operator families and learning distinct operators. It introduces new architectures with theoretical guarantees and empirical validation on PDE benchmarks.

Details

Motivation: Scientific applications require approximating mappings between function spaces (operators), but most machine learning focuses on finite-dimensional spaces. There's a need for scalable approaches to learn collections of operators efficiently.

Method: Two approaches: (1) Multiple operator learning using MNO and MONet architectures for parametric operator families, with universal approximation proofs. (2) Learning distinct operators with balanced architectural complexity across subnetworks. Theoretical analysis includes scaling laws and computational efficiency.

Result: Established universal approximation for continuous, integrable, and Lipschitz operators. Derived explicit scaling laws for network size vs accuracy. Empirical experiments on parametric PDE benchmarks show strong expressive power and efficiency.

Conclusion: The work provides a unified theoretical and practical foundation for scalable neural operator learning across multiple operators, with proven approximation guarantees and efficient architectures.

Abstract: While many problems in machine learning focus on learning mappings between finite-dimensional spaces, scientific applications require approximating mappings between function spaces, i.e., operators. We study the problem of learning collections of operators and provide both theoretical and empirical advances. We distinguish between two regimes: (i) multiple operator learning, where a single network represents a continuum of operators parameterized by a parametric function, and (ii) learning several distinct single operators, where each operator is learned independently. For the multiple operator case, we introduce two new architectures, $\mathrm{MNO}$ and $\mathrm{MONet}$, and establish universal approximation results in three settings: continuous, integrable, or Lipschitz operators. For the latter, we further derive explicit scaling laws that quantify how the network size must grow to achieve a target approximation accuracy. For learning several single operators, we develop a framework for balancing architectural complexity across subnetworks and show how approximation order determines computational efficiency. Empirical experiments on parametric PDE benchmarks confirm the strong expressive power and efficiency of the proposed architectures. Overall, this work establishes a unified theoretical and practical foundation for scalable neural operator learning across multiple operators.

[345] GPTOpt: Towards Efficient LLM-Based Black-Box Optimization

Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Jie Chen, Wojciech Matusik, Mina Konaković Luković

Main category: cs.LG

TL;DR: GPTOpt is an LLM-based optimization method that enables large language models to perform continuous black-box optimization by fine-tuning them on synthetic datasets from diverse Bayesian Optimization parameterizations, achieving better performance than traditional optimizers without requiring parameter tuning.

Details

Motivation: Traditional Bayesian Optimization methods require careful parameter tuning for each application domain, while current LLMs lack capabilities for continuous black-box optimization tasks. The goal is to leverage LLMs' generalization abilities for sample-efficient global optimization.

Method: Fine-tune large language models on extensive synthetic datasets derived from diverse Bayesian Optimization parameterizations, enabling LLMs to learn optimization strategies that generalize across different tasks.

Result: GPTOpt surpasses traditional optimizers on various black-box optimization benchmarks, demonstrating superior performance in continuous optimization tasks without parameter tuning.

Conclusion: LLMs can be effectively equipped with continuous black-box optimization capabilities through fine-tuning on synthetic BO data, providing a flexible framework for global optimization that outperforms traditional methods and eliminates the need for parameter tuning.

Abstract: Global optimization of expensive, derivative-free black-box functions demands extreme sample efficiency. Classical methods such as Bayesian Optimization (BO) can be effective, but they often require careful parameter tuning to each application domain. At the same time, Large Language Models (LLMs) have shown broad capabilities, yet state-of-the-art models remain limited in solving continuous black-box optimization tasks. We introduce GPTOpt, an LLM-based optimization method that equips LLMs with continuous black-box optimization capabilities. By fine-tuning large language models on extensive synthetic datasets derived from diverse BO parameterizations, GPTOpt leverages LLM pre-training to generalize across optimization tasks. On a variety of black-box optimization benchmarks, GPTOpt surpasses traditional optimizers, highlighting the capacity of LLMs for advanced numerical reasoning and introducing a flexible framework for global optimization without parameter tuning.

[346] Scalable Utility-Aware Multiclass Calibration

Mahmoud Hegazy, Michael I. Jordan, Aymeric Dieuleveut

Main category: cs.LG

TL;DR: The paper proposes utility calibration, a general framework for evaluating multiclass classifier calibration that measures calibration error relative to specific utility functions relevant to end users.

Details

Motivation: Existing methods for assessing multiclass calibration focus on specific prediction aspects or use computationally challenging formulations, lacking a scalable and comprehensive evaluation approach.

Method: The authors introduce utility calibration, which measures calibration error relative to user-specific utility functions, allowing unification and reinterpretation of existing calibration metrics.

Result: The framework enables more robust versions of top-class and class-wise calibration metrics, and extends beyond binarized approaches to assess calibration for richer classes of downstream utilities.

Conclusion: Utility calibration provides a scalable and general framework for multiclass calibration evaluation that can accommodate diverse user needs and decision criteria.

Abstract: Ensuring that classifiers are well-calibrated, i.e., their predictions align with observed frequencies, is a minimal and fundamental requirement for classifiers to be viewed as trustworthy. Existing methods for assessing multiclass calibration often focus on specific aspects associated with prediction (e.g., top-class confidence, class-wise calibration) or utilize computationally challenging variational formulations. In this work, we study scalable \emph{evaluation} of multiclass calibration. To this end, we propose utility calibration, a general framework that measures the calibration error relative to a specific utility function that encapsulates the goals or decision criteria relevant to the end user. We demonstrate how this framework can unify and re-interpret several existing calibration metrics, particularly allowing for more robust versions of the top-class and class-wise calibration metrics, and, going beyond such binarized approaches, toward assessing calibration for richer classes of downstream utilities.

[347] Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks

Florian A. Hölzl, Daniel Rueckert, Georgios Kaissis

Main category: cs.LG

TL;DR: Gradient-Weight Alignment (GWA) is introduced as a validation metric that tracks generalization during training by measuring coherence between per-sample gradients and model weights, enabling early stopping detection and sample influence analysis without validation data.

Details

Motivation: Robust validation metrics are essential for detecting overfitting, monitoring training dynamics, and attributing performance to individual training samples in supervised classification.

Method: Gradient-Weight Alignment (GWA) quantifies the coherence between per-sample gradients and model weights, with effective learning corresponding to coherent alignment and misalignment indicating poor generalization.

Result: GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis.

Conclusion: GWA serves as an effective validation metric that tracks generalization during training and provides insights into training dynamics directly from training data, eliminating the need for separate validation sets.

Abstract: Robust validation metrics remain essential in contemporary deep learning, not only to detect overfitting and poor generalization, but also to monitor training dynamics. In the supervised classification setting, we investigate whether interactions between training data and model weights can yield such a metric that both tracks generalization during training and attributes performance to individual training samples. We introduce Gradient-Weight Alignment (GWA), quantifying the coherence between per-sample gradients and model weights. We show that effective learning corresponds to coherent alignment, while misalignment indicates deteriorating generalization. GWA is efficiently computable during training and reflects both sample-specific contributions and dataset-wide learning dynamics. Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.

[348] Right for the Right Reasons: Avoiding Reasoning Shortcuts via Prototypical Neurosymbolic AI

Luca Andolfi, Eleonora Giunchiglia

Main category: cs.LG

TL;DR: Proposes prototypical neurosymbolic architectures to prevent reasoning shortcuts by ensuring models learn correct concepts rather than exploiting spurious correlations, even with very limited data.

Details

Motivation: Address the problem of reasoning shortcuts in neurosymbolic AI where models learn unintended neural predicates that exploit spurious correlations to satisfy symbolic constraints.

Method: Introduce prototypical neurosymbolic architectures that train models to satisfy background knowledge while considering input similarity to labeled datapoints, leveraging prototypical learning theory.

Result: Significant improvements in learning correct concepts across synthetic tasks (MNIST-EvenOdd, Kand-Logic) and real-world high-stake tasks (BDD-OIA) with very scarce supervision.

Conclusion: Prototype grounding is an effective, annotation-efficient strategy for safe and reliable neurosymbolic learning that prevents reasoning shortcuts.

Abstract: Neurosymbolic AI is growing in popularity thanks to its ability to combine neural perception and symbolic reasoning in end-to-end trainable models. However, recent findings reveal these are prone to shortcut reasoning, i.e., to learning unindented concepts–or neural predicates–which exploit spurious correlations to satisfy the symbolic constraints. In this paper, we address reasoning shortcuts at their root cause and we introduce prototypical neurosymbolic architectures. These models are able to satisfy the symbolic constraints (be right) because they have learnt the correct basic concepts (for the right reasons) and not because of spurious correlations, even in extremely low data regimes. Leveraging the theory of prototypical learning, we demonstrate that we can effectively avoid reasoning shortcuts by training the models to satisfy the background knowledge while taking into account the similarity of the input with respect to the handful of labelled datapoints. We extensively validate our approach on the recently proposed rsbench benchmark suite in a variety of settings and tasks with very scarce supervision: we show significant improvements in learning the right concepts both in synthetic tasks (MNIST-EvenOdd and Kand-Logic) and real-world, high-stake ones (BDD-OIA). Our findings pave the way to prototype grounding as an effective, annotation-efficient strategy for safe and reliable neurosymbolic learning.

[349] TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting

Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter

Main category: cs.LG

TL;DR: TempoPFN is a univariate time series foundation model using linear RNNs trained on synthetic data, achieving top-tier zero-shot performance on challenging benchmarks while being more efficient than existing approaches.

Details

Motivation: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on benchmarks.

Method: Uses GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, pre-trained exclusively on synthetic data from a comprehensive pipeline including stochastic differential equations, Gaussian processes, and audio synthesis.

Result: Achieves top-tier competitive performance on Gift-Eval benchmark, outperforming all existing synthetic-only approaches and surpassing most models trained on real-world data, while being more efficient than baselines.

Conclusion: Provides a reproducible foundation for future research with open-sourced data generation pipeline and training code, demonstrating that synthetic-only training can achieve competitive zero-shot performance.

Abstract: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the vast majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.

[350] Support Vector Machine-Based Burnout Risk Prediction with an Interactive Interface for Organizational Use

Bruno W. G. Teodosio, Mário J. O. T. Lira, Pedro H. M. Araújo, Lucas R. C. Farias

Main category: cs.LG

TL;DR: Machine learning approach using SVM achieved highest performance (R²=0.84) in predicting employee burnout risk, with an interactive interface developed for practical application.

Details

Motivation: Burnout significantly impacts individual well-being and organizational performance, necessitating early detection methods to support mental health strategies in workplaces.

Method: Evaluated three supervised algorithms (KNN, Random Forest, SVM) using HackerEarth Employee Burnout Challenge dataset with 30-fold cross-validation and R² metric for performance evaluation.

Result: SVM outperformed other models with R²=0.84 and was statistically superior to KNN and Random Forest based on paired t-tests. An interactive Streamlit interface was developed for user-friendly predictions.

Conclusion: Machine learning shows strong potential for early burnout detection and can support data-driven mental health interventions in organizational settings.

Abstract: Burnout is a psychological syndrome marked by emotional exhaustion, depersonalization, and reduced personal accomplishment, with a significant impact on individual well-being and organizational performance. This study proposes a machine learning approach to predict burnout risk using the HackerEarth Employee Burnout Challenge dataset. Three supervised algorithms were evaluated: nearest neighbors (KNN), random forest, and support vector machine (SVM), with model performance evaluated through 30-fold cross-validation using the determination coefficient (R2). Among the models tested, SVM achieved the highest predictive performance (R2 = 0.84) and was statistically superior to KNN and Random Forest based on paired $t$-tests. To ensure practical applicability, an interactive interface was developed using Streamlit, allowing non-technical users to input data and receive burnout risk predictions. The results highlight the potential of machine learning to support early detection of burnout and promote data-driven mental health strategies in organizational settings.

[351] FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer, Bernt Schiele

Main category: cs.LG

TL;DR: A new model with model-inherent mechanistic concept-explanations that provides faithful concept-based explanations shared across classes, with traceable contributions to logits and input visualizations.

Details

Motivation: Existing post-hoc concept-based explanation methods are not always faithful to the model and make restrictive assumptions about concepts (class-specificity, small spatial extent, alignment to human expectations).

Method: Proposed a new model with model-inherent mechanistic concept-explanations where concepts are shared across classes, their contribution to logits can be faithfully traced, and input visualizations are provided. Also introduced C²-Score metric using foundation models to evaluate concept consistency.

Result: The proposed concepts are quantitatively more consistent than prior work, users find them more interpretable, and the model maintains competitive ImageNet performance.

Conclusion: The approach provides more faithful and interpretable concept-based explanations while maintaining model performance, addressing limitations of existing post-hoc explanation methods.

Abstract: Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

[352] Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

Yuan Cheng, Yu Huang, Zhe Xiong, Yingbin Liang, Vincent Y. F. Tan

Main category: cs.LG

TL;DR: The paper introduces a novel information-theoretic metric (KG-MI) to enable transformers to provably learn multiple parent dependencies in DAGs, achieving polynomial-time convergence and accurate graph structure recovery.

Details

Motivation: Current transformer-based models lack theoretical guarantees for learning complex dependencies in general DAGs beyond tree-like structures, due to challenges in designing training objectives that can separately learn multiple parent relationships.

Method: Proposes kernel-guided mutual information (KG-MI) based on f-divergence, combined with multi-head attention where each head models distinct parent-child dependencies using marginal transition kernels.

Result: Proves polynomial-time convergence to global optimum for single-layer multi-head transformers on K-parent DAGs, with learned attention scores accurately reflecting the ground-truth adjacency matrix when using KL divergence.

Conclusion: The proposed KG-MI objective enables transformers to provably recover underlying graph structures in general DAGs, extending theoretical guarantees beyond tree-like graphs.

Abstract: Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) – which involve multiple parents per node – remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.

[353] Hybrid Quantum-Classical Recurrent Neural Networks

Wenduan Xu

Main category: cs.LG

TL;DR: A hybrid quantum-classical recurrent neural network (QRNN) where the recurrent core is a parametrized quantum circuit (PQC) in an exponentially large Hilbert space, with classical feedforward network providing nonlinear control and mid-circuit measurements for readouts.

Details

Motivation: To create a quantum recurrent architecture that combines unitary recurrence for high-capacity memory, partial observation via mid-circuit measurements, and nonlinear classical control for input-conditioned parametrization in a physically consistent framework.

Method: The QRNN uses an n-qubit PQC as the recurrent core with hidden state in C^2^n space. At each timestep, mid-circuit readouts are combined with input embedding and processed by a classical feedforward network, which parametrizes the PQC for unitary state updates. Includes projective measurements for readouts while maintaining coherent quantum memory.

Result: Evaluated on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling with up to 14 qubits. Achieved competitive performance against strong classical baselines across sequence-learning tasks. Also demonstrated effective soft attention mechanism for machine translation in sequence-to-sequence model.

Conclusion: This is the first quantum-grounded model to achieve competitive performance against classical baselines across a broad class of sequence-learning tasks, unifying quantum memory, partial observation, and classical nonlinear control in a compact, physically consistent architecture.

Abstract: We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the entire recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC, residing in an exponentially large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit measurements, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling, adopting projective measurements as a limiting case to obtain mid-circuit readouts while maintaining a coherent recurrent quantum memory. We further devise a soft attention mechanism over the mid-circuit readouts in a sequence-to-sequence model and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.

[354] Leveraging an Atmospheric Foundational Model for Subregional Sea Surface Temperature Forecasting

Víctor Medina, Giovanny A. Cuervo-Londoño, Javier Sánchez

Main category: cs.LG

TL;DR: Adapted Aurora deep learning model from atmospheric forecasting to predict sea surface temperature in the Canary Upwelling System, achieving high accuracy with low computational cost.

Details

Motivation: Traditional ocean forecasting models have high computational costs and scalability limitations, while deep learning offers more efficient alternatives for predicting oceanographic variables.

Method: Fine-tuned Aurora foundational model using high-resolution oceanographic reanalysis data with staged fine-tuning process, latitude-weighted error metrics, and hyperparameter optimization.

Result: Achieved RMSE of 0.119K and high anomaly correlation coefficients (ACC ≈ 0.997), successfully capturing large-scale SST structures but struggling with finer coastal details.

Conclusion: Demonstrated feasibility of adapting deep learning models pre-trained in different domains for ocean forecasting, with future improvements including additional variables, higher resolution, and physics-informed networks.

Abstract: The accurate prediction of oceanographic variables is crucial for understanding climate change, managing marine resources, and optimizing maritime activities. Traditional ocean forecasting relies on numerical models; however, these approaches face limitations in terms of computational cost and scalability. In this study, we adapt Aurora, a foundational deep learning model originally designed for atmospheric forecasting, to predict sea surface temperature (SST) in the Canary Upwelling System. By fine-tuning this model with high-resolution oceanographic reanalysis data, we demonstrate its ability to capture complex spatiotemporal patterns while reducing computational demands. Our methodology involves a staged fine-tuning process, incorporating latitude-weighted error metrics and optimizing hyperparameters for efficient learning. The experimental results show that the model achieves a low RMSE of 0.119K, maintaining high anomaly correlation coefficients (ACC $\approx 0.997$). The model successfully reproduces large-scale SST structures but faces challenges in capturing finer details in coastal regions. This work contributes to the field of data-driven ocean forecasting by demonstrating the feasibility of using deep learning models pre-trained in different domains for oceanic applications. Future improvements include integrating additional oceanographic variables, increasing spatial resolution, and exploring physics-informed neural networks to enhance interpretability and understanding. These advancements can improve climate modeling and ocean prediction accuracy, supporting decision-making in environmental and economic sectors.

[355] A Framework for Bounding Deterministic Risk with PAC-Bayes: Applications to Majority Votes

Benjamin Leblanc, Pascal Germain

Main category: cs.LG

TL;DR: The paper introduces a framework to extract deterministic generalization guarantees from stochastic PAC-Bayesian bounds, enabling practical deployment of single hypotheses.

Details

Motivation: Classical PAC-Bayes only provides guarantees for randomly sampled hypotheses, requiring stochastic predictions at test time, which is impractical for many real-world applications where deterministic classifiers are needed.

Method: Proposed a unified framework with a general oracle bound, derived numerical bounds, and specialized the approach to majority vote classifiers.

Result: Empirical results show the approach consistently outperforms popular baselines by up to a factor of 2 in generalization bounds for deterministic classifiers.

Conclusion: The framework successfully bridges the gap between stochastic PAC-Bayesian guarantees and practical deterministic hypothesis deployment, providing tighter generalization bounds.

Abstract: PAC-Bayes is a popular and efficient framework for obtaining generalization guarantees in situations involving uncountable hypothesis spaces. Unfortunately, in its classical formulation, it only provides guarantees on the expected risk of a randomly sampled hypothesis. This requires stochastic predictions at test time, making PAC-Bayes unusable in many practical situations where a single deterministic hypothesis must be deployed. We propose a unified framework to extract guarantees holding for a single hypothesis from stochastic PAC-Bayesian guarantees. We present a general oracle bound and derive from it a numerical bound and a specialization to majority vote. We empirically show that our approach consistently outperforms popular baselines (by up to a factor of 2) when it comes to generalization bounds on deterministic classifiers.

[356] Perturbation Bounds for Low-Rank Inverse Approximations under Noise

Phuc Tran, Nisheeth K. Vishnoi

Main category: cs.LG

TL;DR: This paper studies the spectral-norm robustness of low-rank pseudoinverses under noise, deriving sharp perturbation bounds that improve classical results by up to √n factor.

Details

Motivation: Low-rank pseudoinverses are widely used but their robustness to noise in real-world applications (sampling, sketching, quantization) remains poorly understood.

Method: Uses contour integral techniques applied to the non-entire function f(z)=1/z to analyze spectral-norm error between noisy and true low-rank inverse approximations.

Result: Derives sharp non-asymptotic perturbation bounds that scale with eigengap, spectral decay, and noise alignment with low-curvature directions. Bounds closely track true error in experiments.

Conclusion: Provides practical, spectrum-aware guarantees for low-rank inverse approximations in noisy computational environments, significantly improving over classical bounds.

Abstract: Low-rank pseudoinverses are widely used to approximate matrix inverses in scalable machine learning, optimization, and scientific computing. However, real-world matrices are often observed with noise, arising from sampling, sketching, and quantization. The spectral-norm robustness of low-rank inverse approximations remains poorly understood. We systematically study the spectral-norm error $| (\tilde{A}^{-1})_p - A_p^{-1} |$ for an $n\times n$ symmetric matrix $A$, where $A_p^{-1}$ denotes the best rank-(p) approximation of $A^{-1}$, and $\tilde{A} = A + E$ is a noisy observation. Under mild assumptions on the noise, we derive sharp non-asymptotic perturbation bounds that reveal how the error scales with the eigengap, spectral decay, and noise alignment with low-curvature directions of $A$. Our analysis introduces a novel application of contour integral techniques to the \emph{non-entire} function $f(z) = 1/z$, yielding bounds that improve over naive adaptations of classical full-inverse bounds by up to a factor of $\sqrt{n}$. Empirically, our bounds closely track the true perturbation error across a variety of real-world and synthetic matrices, while estimates based on classical results tend to significantly overpredict. These findings offer practical, spectrum-aware guarantees for low-rank inverse approximations in noisy computational environments.

[357] Generalized Sobolev IPM for Graph-Based Measures

Tam Le, Truyen Nguyen, Hideitsu Hino, Kenji Fukumizu

Main category: cs.LG

TL;DR: The paper proposes a generalization of Sobolev IPM using Orlicz geometric structure to overcome limitations of L^p geometry, introduces Musielak regularization for computational efficiency, and demonstrates superior performance in document classification and topological data analysis.

Details

Motivation: To overcome the limitation of Sobolev IPM being intrinsically bound to L^p geometric structure, which restricts its ability to incorporate alternative structural priors beyond the L^p geometry paradigm.

Method: Generalize Sobolev IPM through Orlicz geometric structure, establish connection between Orlicz-Sobolev norm and Musielak norm for regularization, and exploit graph structure to reduce the problem to univariate optimization.

Result: GSI-M is several-order faster than popular Orlicz-Wasserstein (OW) in computation and demonstrates practical advantages in comparing probability measures on graphs for document classification and topological data analysis tasks.

Conclusion: The proposed generalized Sobolev IPM with Musielak regularization provides a computationally efficient framework that accommodates diverse geometric priors beyond traditional L^p structure while maintaining strong empirical performance.

Abstract: We study the Sobolev IPM problem for measures supported on a graph metric space, where critic function is constrained to lie within the unit ball defined by Sobolev norm. While Le et al. (2025) achieved scalable computation by relating Sobolev norm to weighted $L^p$-norm, the resulting framework remains intrinsically bound to $L^p$ geometric structure, limiting its ability to incorporate alternative structural priors beyond the $L^p$ geometry paradigm. To overcome this limitation, we propose to generalize Sobolev IPM through the lens of \emph{Orlicz geometric structure}, which employs convex functions to capture nuanced geometric relationships, building upon recent advances in optimal transport theory – particularly Orlicz-Wasserstein (OW) and generalized Sobolev transport – that have proven instrumental in advancing machine learning methodologies. This generalization encompasses classical Sobolev IPM as a special case while accommodating diverse geometric priors beyond traditional $L^p$ structure. It however brings up significant computational hurdles that compound those already inherent in Sobolev IPM. To address these challenges, we establish a novel theoretical connection between Orlicz-Sobolev norm and Musielak norm which facilitates a novel regularization for the generalized Sobolev IPM (GSI). By further exploiting the underlying graph structure, we show that GSI with Musielak regularization (GSI-M) reduces to a simple \emph{univariate optimization} problem, achieving remarkably computational efficiency. Empirically, GSI-M is several-order faster than the popular OW in computation, and demonstrates its practical advantages in comparing probability measures on a given graph for document classification and several tasks in topological data analysis.

[358] Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning

Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy

Main category: cs.LG

TL;DR: Proposes a structured local learning framework using SVD decomposition to train DNNs on low-rank manifolds, achieving BP-comparable accuracy with reduced parameters and memory requirements.

Details

Motivation: Address limitations of BP (high memory/computation) and DFA (poor scalability in deep CNNs) by developing a principled local learning approach that operates efficiently on low-rank weight representations.

Method: Train layers in SVD-decomposed form with updates applied to SVD components using composite loss (cross-entropy + subspace alignment + orthogonality regularization). Construct structured feedback matrices matching SVD structure for consistent alignment.

Result: Achieves accuracy comparable to BP on CIFAR-10, CIFAR-100, and ImageNet while reducing trainable parameters. Ablation studies confirm importance of each loss term in low-rank setting.

Conclusion: Establishes local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training, enabling efficient deep learning without pruning or compression.

Abstract: Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.

[359] Uncertainty Quantification for Regression: A Unified Framework based on kernel scores

Christopher Bülte, Yusuf Sale, Gitta Kutyniok, Eyke Hüllermeier

Main category: cs.LG

TL;DR: The paper introduces a family of uncertainty measures for regression tasks based on proper scoring rules, particularly kernel scores, to address the gap in uncertainty quantification literature that has been largely focused on classification.

Details

Motivation: Regression tasks in safety-critical domains require proper uncertainty quantification, but existing literature remains largely classification-focused, creating a need for regression-specific uncertainty measures.

Method: Proposes a framework using proper scoring rules with emphasis on kernel scores to design measures for total, aleatoric, and epistemic uncertainty. The behavior of measures (tail sensitivity, robustness, OOD responsiveness) is governed by kernel choice.

Result: The framework unifies existing measures and provides principled guidelines for designing new ones. Experiments show the measures are effective in downstream tasks and reveal trade-offs between robustness and out-of-distribution detection performance.

Conclusion: The proposed kernel-based uncertainty measures provide a principled approach for regression uncertainty quantification with customizable behavior through kernel selection, addressing the gap in regression-focused uncertainty literature.

Abstract: Regression tasks, notably in safety-critical domains, require proper uncertainty quantification, yet the literature remains largely classification-focused. In this light, we introduce a family of measures for total, aleatoric, and epistemic uncertainty based on proper scoring rules, with a particular emphasis on kernel scores. The framework unifies several well-known measures and provides a principled recipe for designing new ones whose behavior, such as tail sensitivity, robustness, and out-of-distribution responsiveness, is governed by the choice of kernel. We prove explicit correspondences between kernel-score characteristics and downstream behavior, yielding concrete design guidelines for task-specific measures. Extensive experiments demonstrate that these measures are effective in downstream tasks and reveal clear trade-offs among instantiations, including robustness and out-of-distribution detection performance.

[360] INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo

Main category: cs.LG

TL;DR: This paper systematically compares floating-point (FP) and integer (INT) quantization formats for AI hardware, revealing that fine-grained INT formats like MXINT8 outperform FP counterparts in both accuracy and efficiency for 8-bit quantization, while FP has advantages for 4-bit formats.

Details

Motivation: The motivation is to provide clear guidance for algorithm and hardware co-design by conducting a unified comparison of FP and INT quantization across varying granularities, as current industry trends favor FP formats without comprehensive analysis.

Method: The authors systematically investigate trade-offs between FP and INT formats across different granularities, introduce a symmetric clipping method to resolve gradient bias in fine-grained low-bit INT training, and apply outlier-mitigation techniques like Hadamard rotation.

Result: Results show a critical performance crossover: FP excels in coarse-grained quantization, but for 8-bit fine-grained formats, MXINT8 is superior to FP in both algorithmic accuracy and hardware efficiency. For 4-bit formats, FP often has accuracy advantages, though NVINT4 can surpass NVFP4 with outlier-mitigation techniques.

Conclusion: The findings challenge the current hardware trajectory favoring FP formats, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer better balance of accuracy, power, and efficiency for future AI accelerators.

Abstract: Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.

[361] BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero III

Main category: cs.LG

TL;DR: BOLT-GAN is a modified WGAN framework inspired by Bayes Optimal Learning Threshold that achieves better training stability and outperforms WGAN on image generation benchmarks with 10-60% lower FID scores.

Details

Motivation: To improve GAN training stability and performance by incorporating the Bayes Optimal Learning Threshold principle into the WGAN framework.

Method: Modified WGAN framework with Lipschitz continuous discriminator that implicitly minimizes a different metric distance than Earth Mover distance, inspired by BOLT principle.

Result: Empirical evaluations on CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64 show BOLT-GAN consistently outperforms WGAN with 10-60% lower Frechet Inception Distance.

Conclusion: BOLT is a broadly applicable principle for enhancing GAN training, providing better stability and performance compared to standard WGAN.

Abstract: We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.

Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov

Main category: cs.LG

TL;DR: Vision-Language-Action (VLA) models suffer from degradation of visual representations during action fine-tuning, but a simple method can mitigate this and improve out-of-distribution generalization.

Details

Motivation: To understand how much original vision-language representations and knowledge are preserved when Vision-Language Models are adapted to action modality in VLA models.

Method: Systematic study of representation retention during VLA fine-tuning, including probing hidden representations, analyzing attention maps, and designing targeted tasks to compare VLA models with their VL counterparts.

Result: Naive action fine-tuning leads to degradation of visual representations, but the proposed method successfully mitigates this degradation and improves generalization to out-of-distribution scenarios.

Conclusion: The analysis clarifies the trade-off between action fine-tuning and VL representation degradation, and provides practical approaches to recover inherited vision-language capabilities.

Abstract: The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA’s hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

[363] Subgraph Federated Learning via Spectral Methods

Javad Aliakbari, Johan Östman, Ashkan Panahi, Alexandre Graell i Amat

Main category: cs.LG

TL;DR: FedLap is a novel federated learning framework for graph-structured data that uses Laplacian smoothing in the spectral domain to capture inter-node dependencies while ensuring privacy and scalability.

Details

Motivation: Existing federated learning approaches for graph data either require exchanging sensitive node embeddings (privacy risks) or rely on computationally-intensive steps (scalability issues), especially for interconnected subgraphs across multiple clients.

Method: FedLap leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while preserving privacy.

Result: Extensive experiments on benchmark datasets show FedLap achieves competitive or superior utility compared to existing techniques, and it provides strong privacy guarantees.

Conclusion: FedLap is the first subgraph federated learning scheme with strong privacy guarantees that effectively addresses both privacy and scalability challenges in graph-structured federated learning.

Abstract: We consider the problem of federated learning (FL) with graph-structured data distributed across multiple clients. In particular, we address the prevalent scenario of interconnected subgraphs, where interconnections between clients significantly influence the learning process. Existing approaches suffer from critical limitations, either requiring the exchange of sensitive node embeddings, thereby posing privacy risks, or relying on computationally-intensive steps, which hinders scalability. To tackle these challenges, we propose FedLap, a novel framework that leverages global structure information via Laplacian smoothing in the spectral domain to effectively capture inter-node dependencies while ensuring privacy and scalability. We provide a formal analysis of the privacy of FedLap, demonstrating that it preserves privacy. Notably, FedLap is the first subgraph FL scheme with strong privacy guarantees. Extensive experiments on benchmark datasets demonstrate that FedLap achieves competitive or superior utility compared to existing techniques.

[364] Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy

Phuc Tran, Nisheeth K. Vishnoi, Van H. Vu

Main category: cs.LG

TL;DR: New spectral-norm perturbation bounds for symmetric matrices that improve on classical Eckart-Young-Mirsky theorem, with applications to differentially private PCA and up to √n factor improvements.

Details

Motivation: Understanding how noise affects low-rank approximations in spectral norm is crucial for differentially private machine learning, as spectral norm captures worst-case directional error and provides strongest utility guarantees.

Method: Developed novel contour bootstrapping method from complex analysis to establish high-probability spectral-norm perturbation bounds, extending to spectral functionals like polynomials and matrix exponentials.

Result: Derived sharp estimates for rank-p approximation errors under mild eigengap and norm conditions, with improvements up to √n factor over classical bounds, and applied to differentially private PCA.

Conclusion: The new perturbation bounds provide improved utility guarantees for differentially private PCA and closely track actual spectral error in empirical evaluations on real-world datasets.

Abstract: A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top-$p$ structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart–Young–Mirsky theorem and explicitly capture interactions between a matrix $A \in \mathbb{R}^{n \times n}$ and an arbitrary symmetric perturbation $E$. Under mild eigengap and norm conditions, our bounds yield sharp estimates for $|(A + E)_p - A_p|$, where $A_p$ is the best rank-$p$ approximation of $A$, with improvements of up to a factor of $\sqrt{n}$. As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.

[365] Mechanistic Interpretability of RNNs emulating Hidden Markov Models

Elia Torre, Michele Viscione, Lucas Pompe, Benjamin F Grewe, Valerio Mante

Main category: cs.LG

TL;DR: RNNs can implement Hidden Markov Model-like discrete state transitions through noise-sustained dynamics along closed orbits, with specialized ‘kick neurons’ driving probabilistic transitions between regions of slow dynamics.

Details

Motivation: To understand how RNNs can generate the richer, spontaneous, and stochastic behaviors observed in natural settings, which appear at odds with their continuous state spaces, and to uncover the mechanisms that enable RNNs to replicate HMM emission statistics.

Method: Train RNNs to replicate HMM emission statistics, then reverse-engineer the trained networks to analyze their dynamics, connectivity patterns, and the emergence of specialized neuron types that drive state transitions.

Result: Trained RNNs develop noise-sustained dynamics along closed orbits, with transitions governed by slow noise-driven dynamics connected by fast deterministic transitions. Networks develop highly structured connectivity with ‘kick neurons’ that initiate transitions, operating in a regime of stochastic resonance.

Conclusion: RNNs can emulate complex discrete latent dynamics through a compositional principle involving modular reuse of the same dynamical motif, enabling them to perform probabilistic computations despite their continuous state space nature.

Abstract: Recurrent neural networks (RNNs) provide a powerful approach in neuroscience to infer latent dynamics in neural populations and to generate hypotheses about the neural computations underlying behavior. However, past work has focused on relatively simple, input-driven, and largely deterministic behaviors - little is known about the mechanisms that would allow RNNs to generate the richer, spontaneous, and potentially stochastic behaviors observed in natural settings. Modeling with Hidden Markov Models (HMMs) has revealed a segmentation of natural behaviors into discrete latent states with stochastic transitions between them, a type of dynamics that may appear at odds with the continuous state spaces implemented by RNNs. Here we first show that RNNs can replicate HMM emission statistics and then reverse-engineer the trained networks to uncover the mechanisms they implement. In the absence of inputs, the activity of trained RNNs collapses towards a single fixed point. When driven by stochastic input, trajectories instead exhibit noise-sustained dynamics along closed orbits. Rotation along these orbits modulates the emission probabilities and is governed by transitions between regions of slow, noise-driven dynamics connected by fast, deterministic transitions. The trained RNNs develop highly structured connectivity, with a small set of “kick neurons” initiating transitions between these regions. This mechanism emerges during training as the network shifts into a regime of stochastic resonance, enabling it to perform probabilistic computations. Analyses across multiple HMM architectures

fully connected, cyclic, and linear-chain - reveal that this solution generalizes through the modular reuse of the same dynamical motif, suggesting a compositional principle by which RNNs can emulate complex discrete latent dynamics.

[366] Graph Network-based Structural Simulator: Graph Neural Networks for Structural Dynamics

Alessandro Lucchetti, Francesco Cadini, Marco Giglio, Luca Lomazzi

Main category: cs.LG

TL;DR: GNSS is a Graph Neural Network framework for surrogate modeling of dynamic structural problems, addressing gaps in existing GNN applications for structural dynamics.

Details

Motivation: Little attention has been given to applying GNNs as surrogate models for structural problems, especially dynamic cases, despite their success in computational fluid dynamics.

Method: GNSS follows encode-process-decode paradigm with three key features: node kinematics in local frames, sign-aware regression loss, and wavelength-informed connectivity radius.

Result: GNSS accurately reproduces physics over hundreds of timesteps, generalizes to unseen loading conditions, and achieves substantial inference speedups compared to explicit finite element methods.

Conclusion: Locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations.

Abstract: Graph Neural Networks (GNNs) have recently been explored as surrogate models for numerical simulations. While their applications in computational fluid dynamics have been investigated, little attention has been given to structural problems, especially for dynamic cases. To address this gap, we introduce the Graph Network-based Structural Simulator (GNSS), a GNN framework for surrogate modeling of dynamic structural problems. GNSS follows the encode-process-decode paradigm typical of GNN-based machine learning models, and its design makes it particularly suited for dynamic simulations thanks to three key features: (i) expressing node kinematics in node-fixed local frames, which avoids catastrophic cancellation in finite-difference velocities; (ii) employing a sign-aware regression loss, which reduces phase errors in long rollouts; and (iii) using a wavelength-informed connectivity radius, which optimizes graph construction. We evaluate GNSS on a case study involving a beam excited by a 50kHz Hanning-modulated pulse. The results show that GNSS accurately reproduces the physics of the problem over hundreds of timesteps and generalizes to unseen loading conditions, where existing GNNs fail to converge or deliver meaningful predictions. Compared with explicit finite element baselines, GNSS achieves substantial inference speedups while preserving spatial and temporal fidelity. These findings demonstrate that locality-preserving GNNs with physics-consistent update rules are a competitive alternative for dynamic, wave-dominated structural simulations.

[367] Convolutional Spiking-based GRU Cell for Spatio-temporal Data

Yesmine Abdennadher, Eleonora Cicciarella, Michele Rossi

Main category: cs.LG

TL;DR: CS-GRU combines spiking neural networks with convolutional GRUs to better capture local dependencies in temporal and spatio-temporal data, outperforming existing methods by 4.35% on average.

Details

Motivation: Traditional RNNs lose local details in long sequences, and previous SNN-GRU approaches like SpikGRU fail to capture fine-grained local dependencies in event-based spatio-temporal data.

Method: Introduces Convolutional Spiking GRU (CS-GRU) cell that uses convolutional operations to preserve local structure while integrating spiking neurons’ temporal precision with GRU gating mechanisms.

Result: Outperforms state-of-the-art GRU variants by average 4.35%, achieves over 90% accuracy on sequential tasks, 99.31% on MNIST, and 69% higher efficiency than SpikGRU.

Conclusion: CS-GRU provides a versatile architecture that excels on both temporal and spatio-temporal benchmarks, effectively addressing local dependency capture while maintaining high efficiency.

Abstract: Spike-based temporal messaging enables SNNs to efficiently process both purely temporal and spatio-temporal time-series or event-driven data. Combining SNNs with Gated Recurrent Units (GRUs), a variant of recurrent neural networks, gives rise to a robust framework for sequential data processing; however, traditional RNNs often lose local details when handling long sequences. Previous approaches, such as SpikGRU, fail to capture fine-grained local dependencies in event-based spatio-temporal data. In this paper, we introduce the Convolutional Spiking GRU (CS-GRU) cell, which leverages convolutional operations to preserve local structure and dependencies while integrating the temporal precision of spiking neurons with the efficient gating mechanisms of GRUs. This versatile architecture excels on both temporal datasets (NTIDIGITS, SHD) and spatio-temporal benchmarks (MNIST, DVSGesture, CIFAR10DVS). Our experiments show that CS-GRU outperforms state-of-the-art GRU variants by an average of 4.35%, achieving over 90% accuracy on sequential tasks and up to 99.31% on MNIST. It is worth noting that our solution achieves 69% higher efficiency compared to SpikGRU. The code is available at: https://github.com/YesmineAbdennadher/CS-GRU.

[368] LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries

René P. Klausen, Ivan Timofeev, Johannes Frank, Jonas Naujoks, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

Main category: cs.LG

TL;DR: LieSolver uses Lie symmetries to exactly enforce PDEs in IBVPs, leading to faster and more accurate solutions than PINNs with improved convergence and error estimation.

Details

Motivation: To develop a more efficient and reliable method for solving initial-boundary value problems that inherently incorporates physical laws through symmetry transformations.

Method: Uses Lie symmetries to enforce PDEs exactly by construction, learning solutions from initial and boundary data while enabling rigorous error estimation for well-posed problems.

Result: LieSolver is faster and more accurate than physics-informed neural networks (PINNs) for linear homogeneous PDEs with various initial conditions, yielding compact models and efficient optimization.

Conclusion: The method improves both computational efficiency and prediction reliability for PDE-constrained problems through exact PDE enforcement via symmetry transformations.

Abstract: We introduce a method for efficiently solving initial-boundary value problems (IBVPs) that uses Lie symmetries to enforce the associated partial differential equation (PDE) exactly by construction. By leveraging symmetry transformations, the model inherently incorporates the physical laws and learns solutions from initial and boundary data. As a result, the loss directly measures the model’s accuracy, leading to improved convergence. Moreover, for well-posed IBVPs, our method enables rigorous error estimation. The approach yields compact models, facilitating an efficient optimization. We implement LieSolver and demonstrate its application to linear homogeneous PDEs with a range of initial conditions, showing that it is faster and more accurate than physics-informed neural networks (PINNs). Overall, our method improves both computational efficiency and the reliability of predictions for PDE-constrained problems.

[369] MLPrE – A tool for preprocessing and exploratory data analysis prior to machine learning model construction

David S Maxwell, Michael Darkoh, Sidharth R Samudrala, Caroline Chung, Stephanie T Schmidt, Bissan Al-Lazikani

Main category: cs.LG

TL;DR: MLPrE is a scalable preprocessing tool for machine learning that uses SparkDataFrames and JSON configuration to handle diverse data formats, featuring 69 processing stages for filtering, statistics, feature engineering, and exploratory data analysis.

Details

Motivation: The growing demand for AI and Deep Learning requires efficient data preprocessing tools that can handle multiple data formats and scale with data size, overcoming limitations of existing workflows in larger processing pipelines like Apache Airflow.

Method: Utilizes SparkDataFrames for scalable data processing with a generalizable JSON input format to describe stepwise changes. Implements 69 stages covering input/output, filtering, statistics, feature engineering, and exploratory data analysis.

Result: Successfully demonstrated on six diverse datasets, including processing multiple fields in flat files and preparing data for graph databases. Showed capability for independent field processing and recombination, and clustering with wine quality data.

Conclusion: MLPrE provides a generalizable, scalable tool for preprocessing and early data analysis that accelerates and simplifies development in larger machine learning workflows, addressing a critical need in the expanding ML field.

Abstract: With the recent growth of Deep Learning for AI, there is a need for tools to meet the demand of data flowing into those models. In some cases, source data may exist in multiple formats, and therefore the source data must be investigated and properly engineered for a Machine Learning model or graph database. Overhead and lack of scalability with existing workflows limit integration within a larger processing pipeline such as Apache Airflow, driving the need for a robust, extensible, and lightweight tool to preprocess arbitrary datasets that scales with data type and size. To address this, we present Machine Learning Preprocessing and Exploratory Data Analysis, MLPrE, in which SparkDataFrames were utilized to hold data during processing and ensure scalability. A generalizable JSON input file format was utilized to describe stepwise changes to that DataFrame. Stages were implemented for input and output, filtering, basic statistics, feature engineering, and exploratory data analysis. A total of 69 stages were implemented into MLPrE, of which we highlight and demonstrate key stages using six diverse datasets. We further highlight MLPrE’s ability to independently process multiple fields in flat files and recombine them, otherwise requiring an additional pipeline, using a UniProt glossary term dataset. Building on this advantage, we demonstrated the clustering stage with available wine quality data. Lastly, we demonstrate the preparation of data for a graph database in the final stages of MLPrE using phosphosite kinase data. Overall, our MLPrE tool offers a generalizable and scalable tool for preprocessing and early data analysis, filling a critical need for such a tool given the ever expanding use of machine learning. This tool serves to accelerate and simplify early stage development in larger workflows.

[370] Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning

Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

Main category: cs.LG

TL;DR: The paper shows that standard multiple instance learning (MIL) methods fail to capture contextual relationships between adjacent instances (patches/slices), and even newer correlated MIL approaches struggle to achieve optimal performance on synthetic tasks where context is crucial.

Details

Motivation: Conventional MIL approaches treat instances separately, ignoring contextual relationships between nearby patches or slices that are essential in real applications like medical imaging.

Method: Designed a synthetic classification task where accounting for adjacent instance features is crucial, and compared off-the-shelf MIL approaches against the optimal Bayes estimator (available in closed-form). Also evaluated newer correlated MIL methods.

Result: Standard MIL methods show limitations compared to the optimal Bayes estimator. Even newer correlated MIL methods struggle to generalize optimally when trained from scratch on tens of thousands of instances.

Conclusion: Current MIL approaches, including newer correlated methods, fail to effectively capture contextual relationships between instances, highlighting the need for improved methods that can better leverage spatial/temporal dependencies in instance-based learning.

Abstract: Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still struggle to generalize as well as possible when trained from scratch on tens of thousands of instances.

[371] Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions

Naoki Kiyohara, Edward Johns, Yingzhen Li

Main category: cs.LG

TL;DR: Neural Stochastic Flows (NSFs) learn SDE transition laws using conditional normalizing flows, enabling efficient one-shot sampling between arbitrary time points with significant speed-ups.

Details

Motivation: Traditional SDE modeling requires costly numerical solvers for sampling between arbitrary time points, which is inefficient for noisy and irregularly sampled time series common in finance, physics, and machine learning.

Method: Introduce Neural Stochastic Flows (NSFs) and their latent variants that directly learn SDE transition laws using conditional normalizing flows with architectural constraints preserving stochastic flow properties.

Result: NSFs achieve up to two orders of magnitude speed-ups at large time gaps while maintaining distributional accuracy comparable to numerical approaches, demonstrated on synthetic SDE simulations and real-world tracking/video data.

Conclusion: NSFs provide an efficient alternative to traditional numerical solvers for SDE modeling, enabling fast arbitrary time-point sampling without sacrificing accuracy.

Abstract: Stochastic differential equations (SDEs) are well suited to modelling noisy and irregularly sampled time series found in finance, physics, and machine learning. Traditional approaches require costly numerical solvers to sample between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent variants, which directly learn (latent) SDE transition laws using conditional normalising flows with architectural constraints that preserve properties inherited from stochastic flows. This enables one-shot sampling between arbitrary states and yields up to two orders of magnitude speed-ups at large time gaps. Experiments on synthetic SDE simulations and on real-world tracking and video data show that NSFs maintain distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling.

[372] Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting

Wei Chen, Yuxuan Liang

Main category: cs.LG

TL;DR: The paper proposes a prompt tuning-based continuous forecasting method for spatio-temporal data streams that addresses inefficiency of retraining and catastrophic forgetting through expand-and-compress principles.

Details

Motivation: Real-world spatio-temporal data arrives in streaming manner with continuously expanding sensor networks, creating challenges in model retraining efficiency and catastrophic forgetting of historical data.

Method: Integrates base spatio-temporal graph neural network with continuous prompt pool using stored prompts in memory, following expand-and-compress tuning principles with lightweight parameters.

Result: Extensive experiments on real-world datasets show superior performance over state-of-the-art baselines in effectiveness, efficiency, and universality.

Conclusion: The proposed prompt tuning-based continuous forecasting method successfully addresses streaming spatio-temporal forecasting challenges with lightweight parameters and multi-faceted advantages.

Abstract: The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.

[373] Revisiting Service Level Objectives and System Level Metrics in Large Language Model Serving

Zhibin Wang, Shipeng Li, Yuhang Zhou, Xue Li, Zhonghui Zhang, Nguyen Cam-Tu, Rong Gu, Chen Tian, Guihai Chen, Sheng Zhong

Main category: cs.LG

TL;DR: The paper proposes a new metric framework called smooth goodput that better aligns with user experience in LLM serving by addressing issues in existing SLO and SLM metrics.

Details

Motivation: Existing metrics for LLM serving have counterintuitive behaviors where delaying token delivery can improve SLOs and abandoning requests can improve SLMs, which don't align with actual user experience.

Method: The authors revisit SLOs and SLMs, propose a new SLO that aligns with user experience, and create a comprehensive metric framework called smooth goodput that integrates both SLOs and SLMs.

Result: Evaluation shows the smooth goodput framework provides a more comprehensive view of token delivery and request processing, effectively capturing the optimal balance between user experience and system performance across different serving strategies.

Conclusion: The proposed smooth goodput metric framework offers a unified approach to evaluate LLM serving systems that better reflects the nature of user experience compared to existing metrics.

Abstract: User experience is a critical factor Large Language Model (LLM) serving systems must consider, where service level objectives (SLOs) considering the experience of individual requests and system level metrics (SLMs) considering the overall system performance are two key performance measures. However, we observe two notable issues in existing metrics: 1) manually delaying the delivery of some tokens can improve SLOs, and 2) actively abandoning requests that do not meet SLOs can improve SLMs, both of which are counterintuitive. In this paper, we revisit SLOs and SLMs in LLM serving, and propose a new SLO that aligns with user experience. Based on the SLO, we propose a comprehensive metric framework called smooth goodput, which integrates SLOs and SLMs to reflect the nature of user experience in LLM serving. Through this unified framework, we reassess the performance of different LLM serving systems under multiple workloads. Evaluation results show that our metric framework provides a more comprehensive view of token delivery and request processing, and effectively captures the optimal point of user experience and system performance with different serving strategies.

[374] Meta-Learning Objectives for Preference Optimization

Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh

Main category: cs.LG

TL;DR: The paper proposes using MuJoCo tasks as a cheaper diagnostic benchmark for evaluating preference optimization algorithms, introduces Mirror Preference Optimization (MPO) using mirror descent, and discovers specialized algorithms through evolutionary strategies that outperform existing methods in both MuJoCo and LLM alignment tasks.

Details

Motivation: Evaluating preference optimization algorithms on LLM alignment is expensive, noisy, and involves many variables like model size and hyperparameters, making systematic analysis difficult.

Method: Designed MuJoCo diagnostic suite for controlled evaluation, proposed Mirror Preference Optimization (MPO) based on mirror descent, used evolutionary strategies to discover algorithms specialized for different dataset properties (mixed-quality, noisy data).

Result: Discovered PO algorithms outperform all known algorithms in targeted MuJoCo settings, and insights from MuJoCo experiments led to a PO algorithm that significantly outperforms existing baselines in LLM alignment tasks.

Conclusion: Simpler benchmarks like MuJoCo can provide valuable insights for preference optimization algorithm development, and the proposed MPO framework with evolutionary search can discover effective algorithms that transfer well to complex tasks like LLM alignment.

Abstract: Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.

[375] Non-Markovian Discrete Diffusion with Causal Language Models

Yangtian Zhang, Sizhuang He, Daniel Levine, Lawrence Zhao, David Zhang, Syed A Rizvi, Shiyang Zhang, Emanuele Zappala, Rex Ying, David van Dijk

Main category: cs.LG

TL;DR: CaDDi is a causal discrete diffusion model that lifts the Markov constraint by conditioning on the entire generative trajectory, allowing error correction and unifying causal and diffusion reasoning in a single transformer architecture.

Details

Motivation: Discrete diffusion models lag behind causal language models in expressive power due to their Markovian assumption, which restricts conditioning to current state only and leads to uncorrectable error accumulation.

Method: Introduces CaDDi, a non-Markovian discrete diffusion model that conditions on the entire generative trajectory, uses a unified transformer architecture that treats standard causal language models as a special case, and allows direct reuse of pretrained LLM weights.

Result: Outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks and substantially narrows the gap to large autoregressive transformers.

Conclusion: CaDDi successfully addresses limitations of traditional discrete diffusion models by lifting Markov constraints and enabling trajectory-based conditioning, while maintaining compatibility with existing LLM architectures.

Abstract: Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.

[376] LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter

Main category: cs.LG

TL;DR: LaM-SLidE introduces identifier representations (IDs) to enable traceable latent space modeling of spatial dynamical systems while leveraging efficiency from image/video generation methods.

Details

Motivation: Current generative models struggle with dynamical systems involving entity interactions, connectivity patterns, and entity conservation, which require traceability of individual entities over time.

Method: Uses identifier representations (IDs) to retrieve entity properties and composition from latent system representations, enabling traceability while leveraging pre-trained encoders/decoders from image/video generation.

Result: LaM-SLidE performs favorably across domains in terms of speed, accuracy, and generalizability compared to existing approaches.

Conclusion: The approach successfully bridges the gap between entity traceability and efficient latent space modeling for spatial dynamical systems.

Abstract: Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems – from chemical molecule structures to collective human behavior – are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .

[377] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder, Deep Karkhanis

Main category: cs.LG

TL;DR: PKPO is a reinforcement learning method that transforms rewards to directly optimize pass@k performance, enabling better exploration and solving harder problems by prioritizing joint utility of sample sets rather than individual samples.

Details

Motivation: Traditional RL algorithms optimize for pass@1 performance, which under-utilizes sampling capacity and limits exploration on harder examples by prioritizing isolated sample strength over diversity and collective utility.

Method: Proposes Pass-at-k Policy Optimization (PKPO) with novel low-variance unbiased estimators for pass@k and its gradient in both binary and continuous reward settings. The method transforms final rewards to optimize for sets of samples that maximize reward when considered jointly.

Result: PKPO effectively optimizes for target k values, enables solving more and harder problems, and boosts both pass@1 and pass@k performance. On challenging tasks where conventional pass@1 optimization stalls, PKPO unblocks learning through better exploration.

Conclusion: PKPO provides the first robust optimization of pass@k for any k ≤ n, allowing annealing during training to achieve strong pass@1 performance alongside significant pass@k gains, overcoming limitations of traditional RL approaches.

Abstract: Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

[378] Reinforcement Learning Teachers of Test Time Scaling

Edoardo Cetin, Tianyu Zhao, Yujin Tang

Main category: cs.LG

TL;DR: Introduces Reinforcement-Learned Teachers (RLTs) that generate detailed explanations for distillation instead of solving problems directly, achieving better performance than traditional RL approaches.

Details

Motivation: Addresses RL's exploration challenge in reasoning LMs and focuses on creating effective teachers for distillation rather than deployable models.

Method: Train RLTs with dense rewards based on student understanding, using question-solution pairs to generate detailed explanations for connecting concepts.

Result: 7B RLT outperforms larger LMs in distillation pipelines on competition and graduate-level tasks, maintains effectiveness with larger students and out-of-distribution tasks.

Conclusion: RLTs unlock new efficiency and re-usability levels for RL reasoning frameworks by focusing on effective teaching rather than direct problem-solving.

Abstract: Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL’s exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply “connect-the-dots” with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem’s solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Code available at: https://github.com/SakanaAI/RLT

[379] Score-Aware Policy-Gradient and Performance Guarantees using Local Lyapunov Stability

Céline Comte, Matthieu Jonckheere, Jaron Sanders, Albert Senen-Cerda

Main category: cs.LG

TL;DR: A policy-gradient method for model-based RL that uses stationary distributions from MDPs in stochastic systems, introducing score-aware gradient estimators (SAGEs) that avoid value-function estimation.

Details

Motivation: To improve policy gradient methods for average-reward RL by exploiting stationary distributions that belong to exponential families parametrized by policy parameters, particularly in stochastic networks and queueing systems.

Method: Introduces score-aware gradient estimators (SAGEs) that enable policy gradient estimation without value-function estimation, using stationary distributions from MDPs that form exponential families.

Result: SAGE-based policy-gradient locally converges with provable regret bounds, even with countable state spaces and unstable policies. Numerical comparisons show SAGE finds near-optimal policies faster than actor-critic methods.

Conclusion: SAGE provides an efficient alternative to actor-critic methods for average-reward RL in systems with exponential family stationary distributions, offering faster convergence to optimal policies.

Abstract: In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.

[380] Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys

Main category: cs.LG

TL;DR: MUDMAN is a new robust unlearning method that prevents recovery of dangerous capabilities in language models through disruption masking, gradient normalization, and meta-learning, outperforming prior methods by 40%.

Details

Motivation: Language models retain dangerous knowledge even after safety fine-tuning, and current unlearning methods can be easily reversed, posing misuse and misalignment risks.

Method: Disruption Masking (only updating weights where unlearning and retaining gradients have same sign), gradient normalization, and meta-learning combined into MUDMAN framework.

Result: MUDMAN prevents recovery of dangerous capabilities and outperforms prior TAR method by 40%, setting new state-of-the-art for robust unlearning.

Conclusion: The proposed MUDMAN method provides irreversible unlearning of dangerous capabilities through systematic combination of disruption masking, normalization, and meta-learning techniques.

Abstract: Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.

[381] Hyperparameters in Continual Learning: A Reality Check

Sungmin Cha, Kyunghyun Cho

Main category: cs.LG

TL;DR: The paper proposes a new evaluation protocol (GTEP) for continual learning algorithms that assesses generalizability across different datasets, revealing that most state-of-the-art methods’ performance is significantly overestimated by conventional evaluation protocols.

Details

Motivation: Current continual learning evaluation protocols overestimate algorithm performance by using the same scenario for hyperparameter tuning and evaluation, which is unrealistic for real-world applications where algorithms must generalize to unseen scenarios.

Method: Proposed Generalizable Two-phase Evaluation Protocol (GTEP) with separate hyperparameter tuning and evaluation phases using different datasets but same scenario configuration, applied to class-incremental learning with and without pretrained models.

Result: Across 8,000+ experiments, most state-of-the-art algorithms failed to replicate their reported performance, showing their continual learning capacity was significantly overestimated by conventional evaluation methods.

Conclusion: The GTEP protocol provides a more realistic evaluation of continual learning algorithms’ generalizability, revealing critical limitations in current state-of-the-art methods that were masked by conventional evaluation practices.

Abstract: Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.

[382] Differential Mamba

Nadav Schneider, Itamar Zimerman, Eliya Nachmani

Main category: cs.LG

TL;DR: The paper introduces a differential mechanism for Mamba architecture to address attention overallocation to irrelevant context, improving retrieval capabilities and performance over vanilla Mamba.

Details

Motivation: Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations that degrade LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness.

Method: Developed a novel differential mechanism for Mamba architecture, requiring careful architectural modifications rather than naive adaptation of Transformer differential design techniques. Conducted extensive ablation studies and empirical analyses.

Result: Empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Effectively mitigates the overallocation problem in Mamba-based models.

Conclusion: Differential design techniques originally developed for Transformers can be successfully applied to Mamba architecture with proper modifications, leading to enhanced performance and capabilities while maintaining Mamba’s efficiency advantages.

Abstract: Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available: https://github.com/NadavSc/Diff-Mamba

[383] OmegAMP: Targeted AMP Discovery through Biologically Informed Generation

Diogo Soares, Leon Hetzel, Paulina Szymczak, Marcelo Der Torossian Torres, Johanna Sommer, Cesar de la Fuente-Nunez, Fabian Theis, Stephan Günnemann, Ewa Szczurek

Main category: cs.LG

TL;DR: OmegAMP is a diffusion-based framework for controllable antimicrobial peptide generation that achieves 96% experimental success rate by combining fine-grained property control, biologically informed encoding, and synthetic data augmentation.

Details

Motivation: Address challenges in deep learning-based AMP discovery including limited controllability, inefficient property modeling, and low experimental hit rates.

Method: Uses diffusion-based generative model with novel conditioning mechanism for fine-grained control over physicochemical properties and activity profiles. Employs biologically informed encoding space and synthetic data augmentation for AMP filtering.

Result: Tested 25 candidate peptides - 24 (96%) showed antimicrobial activity, effective against multi-drug resistant strains. State-of-the-art performance in AMP discovery pipeline.

Conclusion: OmegAMP significantly advances computational frameworks against antimicrobial resistance with unprecedented experimental success rates.

Abstract: Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling us to achieve an unprecedented success rate in wet lab experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP’s potential to significantly advance computational frameworks in the fight against antimicrobial resistance.

[384] How Many Ratings per Item are Necessary for Reliable Significance Testing?

Christopher Homan, Flip Korn, Deepak Pandita, Chris Welty

Main category: cs.LG

TL;DR: Current AI evaluation methods are unreliable because they use too few responses per item. The paper shows that even 5-10 responses per item are insufficient for reliable statistical testing, and proposes methods to determine adequate response counts.

Details

Motivation: Traditional ML evaluation assumes reliable model and human responses against gold standards, but generative AI's stochastic nature and evidence of human unreliability challenge this. Current practices use too few responses per item for reliable evaluation.

Method: Adapted an existing method for evaluating metric reliability to determine if datasets have enough responses per item for reliable null hypothesis statistical testing.

Result: Analysis shows that for many common metrics, collecting 5-10 responses per item is insufficient. Even existing gold standard datasets with multiple responses lack adequate response counts.

Conclusion: The proposed methods can help AI researchers make better decisions about data collection for reliable AI evaluation, addressing the fundamental reliability issues in current evaluation practices.

Abstract: A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard’’ data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI – along with strong evidence that humans are unreliable judges – estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.

[385] Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization

Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, Lei Cao

Main category: cs.LG

TL;DR: A novel two-stage noisy learning framework that uses a dynamically weighted loss function and wrong event metric to handle noisy labels, achieving better performance with reduced computational costs.

Details

Motivation: Existing methods for learning with noisy labels face limitations like high computational costs, heavy hyperparameter tuning, and coarse-grained optimization. Deep neural networks degrade in generalization performance under noisy supervision.

Method: Two-stage framework: 1) Collect wrong event information and build a strong base model, 2) Perform noise-robust training using a probabilistic model to handle wrong event information. Uses dynamically weighted loss function and wrong event metric to model sample cleanliness and difficulty.

Result: Outperforms state-of-the-art methods on five synthetic and real-world LNL benchmarks, achieves 75% reduction in computational time, and improves model scalability.

Conclusion: The proposed framework effectively addresses limitations of existing noisy label learning methods by enabling instance-level optimization without hyperparameter tuning, while maintaining computational efficiency and improving performance.

Abstract: Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.

[386] Hypergraph clustering using Ricci curvature: an edge transport perspective

Olympio Hacquard

Main category: cs.LG

TL;DR: Novel Ricci flow extension to hypergraphs using probability measures on edges and line expansion, creating effective edge weights for community detection.

Details

Motivation: To develop a more sensitive method for community detection in hypergraphs that better captures the structure, especially with large hyperedges.

Method: Define probability measures on hypergraph edges and transport them on line expansion to create new edge weights, comparing with clique expansion approach.

Result: The method shows enhanced sensitivity to hypergraph structure compared to clique expansion, particularly effective for large hyperedges.

Conclusion: The line expansion and clique expansion Ricci flow methods are complementary and together form a powerful, interpretable framework for hypergraph community detection.

Abstract: In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.

[387] Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

Aditya K. Ranjan, Siddharth Singh, Cunyang Wei, Abhinav Bhatele

Main category: cs.LG

TL;DR: Plexus introduces a 3D parallel approach for full-graph GNN training that scales to billion-edge graphs, achieving 2.3-12.5x speedups over prior state-of-the-art methods.

Details

Motivation: Real-world graphs often exceed GPU memory capacity, requiring mini-batch sampling which has limitations. Distributed full-graph training suffers from high communication overhead and load imbalance due to irregular graph structures.

Method: A three-dimensional (3D) parallel approach for full-graph training with optimizations including a double permutation scheme for load balancing and a performance model to predict optimal 3D configuration.

Result: Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, with time-to-solution reductions of 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier, scaling to 2048 GPUs.

Conclusion: The proposed 3D parallel approach effectively addresses scalability and load balancing challenges in large-scale GNN training, enabling efficient full-graph training on billion-edge graphs.

Abstract: Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation – Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.

[388] Exact Sequence Interpolation with Transformers

Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua

Main category: cs.LG

TL;DR: Transformers can exactly interpolate finite input sequences in R^d (d≥2) with corresponding output sequences of smaller or equal length, using O(∑m^j) blocks and O(d∑m^j) parameters.

Details

Motivation: To provide theoretical understanding of transformer models' excellent performance in exact sequence-to-sequence interpolation tasks and explain their interpolation capabilities.

Method: Construct transformers with alternating feed-forward and self-attention layers, using low-rank parameter matrices in self-attention. Analysis starts with hardmax self-attention and extends to softmax setting.

Result: Exact interpolation of datasets with complexity estimates independent of input sequence length. Construction provides convergence guarantees to global minimizer under regularized training.

Conclusion: Transformers have strong theoretical interpolation capabilities with complexity independent of input length, explaining their practical success in sequence-to-sequence tasks.

Abstract: We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.

[389] Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

Main category: cs.LG

TL;DR: Systematic comparison of RL vs control-based methods for offline navigation tasks shows model-based planning generalizes better to unseen layouts and is more data-efficient, while model-free RL benefits most from large high-quality datasets.

Details

Motivation: To understand comparative strengths of RL and optimal control in offline settings where agents learn from reward-free trajectories, which remains underexplored despite being crucial for developing agents that solve diverse tasks across unseen environments.

Method: Evaluated RL (goal-conditioned and zero-shot) and control-based methods (JEPA-trained latent dynamics model for planning) on navigation tasks using offline datasets of varying quality, analyzing factors like data diversity, trajectory quality, and environment variability.

Result: Model-free RL performs best with large amounts of high-quality data, while model-based planning generalizes better to unseen layouts, is more data-efficient, and achieves trajectory stitching performance comparable to leading model-free methods.

Conclusion: Planning with latent dynamics models is a strong approach for handling suboptimal offline data and adapting to diverse environments, offering better generalization and data efficiency than model-free RL in offline settings.

Abstract: A long-standing goal in AI is to develop agents capable of solving diverse tasks across a range of environments, including those never seen during training. Two dominant paradigms address this challenge: (i) reinforcement learning (RL), which learns policies via trial and error, and (ii) optimal control, which plans actions using a known or learned dynamics model. However, their comparative strengths in the offline setting - where agents must learn from reward-free trajectories - remain underexplored. In this work, we systematically evaluate RL and control-based methods on a suite of navigation tasks, using offline datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot methods. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and employ it for planning. We investigate how factors such as data diversity, trajectory quality, and environment variability influence the performance of these approaches. Our results show that model-free RL benefits most from large amounts of high-quality data, whereas model-based planning generalizes better to unseen layouts and is more data-efficient, while achieving trajectory stitching performance comparable to leading model-free methods. Notably, planning with a latent dynamics model proves to be a strong approach for handling suboptimal offline data and adapting to diverse environments.

[390] Probabilistic Kernel Function for Fast Angle Testing

Kejing Lu, Chuan Xiao, Yoshiharu Ishikawa

Main category: cs.LG

TL;DR: The paper proposes deterministic projection-based kernel functions for angle testing in high-dimensional similarity search, outperforming Gaussian-based methods and achieving 2.5X-3X higher QPS than HNSW.

Details

Motivation: Existing approaches for angle testing in high-dimensional similarity search rely on random Gaussian projections with asymptotic assumptions, which may not be optimal in practice.

Method: Two projection-based probabilistic kernel functions using deterministic projection vectors based on reference angles, without requiring asymptotic assumptions.

Result: The proposed kernel functions outperform Gaussian-distribution-based methods both theoretically and experimentally, and achieve 2.5X-3X higher query-per-second throughput than HNSW when applied to Approximate Nearest Neighbor Search.

Conclusion: Deterministic projection vectors based on reference angles provide superior performance for angle testing in high-dimensional similarity search compared to traditional random Gaussian projections.

Abstract: In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW.

[391] TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems

Arthur Corrêa, Cristóvão Silva, Liming Xu, Alexandra Brintrup, Samuel Moniz

Main category: cs.LG

TL;DR: TuneNSearch is a hybrid method combining reinforcement learning and local search for vehicle routing problems, using pre-training on multi-depot VRP and fine-tuning for other variants, achieving strong generalization across problem types.

Details

Motivation: To develop a versatile approach that can effectively handle diverse VRP variants while maintaining strong performance across different problem formulations, sizes, and instance distributions.

Method: Uses reinforcement learning with Transformer-based architecture enhanced with edge-aware attention, followed by local search refinement. Pre-trains on multi-depot VRP then fine-tunes for other variants.

Result: Outperforms CVRP-pre-trained model by 44% on multi-depot variants while maintaining similar performance on single-depot variants. Achieves less than 3% deviation from best-known solutions on CVRPLIB datasets vs 6-25% for other neural models.

Conclusion: TuneNSearch demonstrates strong generalization across VRP variants, problem sizes, and instance distributions while maintaining polynomial runtime complexity, making it a versatile and effective approach for diverse routing problems.

Abstract: This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing diverse variants of the vehicle routing problem (VRP). Our method uses reinforcement learning to generate high-quality solutions, which are subsequently refined by an efficient local search procedure. To ensure broad adaptability across VRP variants, TuneNSearch begins with a pre-training phase on the multi-depot VRP (MDVRP), followed by a fine-tuning phase to adapt it to other problem formulations. The learning phase utilizes a Transformer-based architecture enhanced with edge-aware attention, which integrates edge distances directly into the attention mechanism to better capture spatial relationships inherent to routing problems. We show that the pre-trained model generalizes effectively to single-depot variants, achieving performance comparable to models trained specifically on single-depot instances. Simultaneously, it maintains strong performance on multi-depot variants, an ability that models pre-trained solely on single-depot problems lack. For example, on 100-node instances of multi-depot variants, TuneNSearch outperforms a model pre-trained on the CVRP by 44%. In contrast, on 100-node instances of single-depot variants, TuneNSearch performs similar to the CVRP model. To validate the effectiveness of our method, we conduct extensive computational experiments on public benchmark and randomly generated instances. Across multiple CVRPLIB datasets, TuneNSearch consistently achieves performance deviations of less than 3% from the best-known solutions in the literature, compared to 6-25% for other neural-based models, depending on problem complexity. Overall, our approach demonstrates strong generalization to different problem sizes, instance distributions, and VRP formulations, while maintaining polynomial runtime complexity despite the integration of the local search algorithm.

[392] A method for the systematic generation of graph XAI benchmarks via Weisfeiler-Leman coloring

Michele Fontanesi, Alessio Micheli, Marco Podda, Domenico Tortorella

Main category: cs.LG

TL;DR: The paper proposes a method to automate the construction of graph XAI benchmarks using Weisfeiler-Leman color refinement to mine class-discriminating motifs as ground-truth explanations, and introduces OpenGraphXAI - a suite of 15 ready-made datasets.

Details

Motivation: Current graph-XAI benchmarks are limited to simplistic synthetic datasets or few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation and stalling progress in the field.

Method: Leverages Weisfeiler-Leman color refinement algorithm to perform approximate subgraph matching and mine class-discriminating motifs that serve as proxy ground-truth class explanations, ensuring these motifs can be learned by GNNs.

Result: Created OpenGraphXAI benchmark suite with 15 ready-made graph-XAI datasets from real-world molecular classification datasets, plus codebase to generate over 2,000 additional benchmarks.

Conclusion: The benchmark suite enables assessment of graph explainer effectiveness and demonstrates the critical role of large benchmark collections for improving experimental significance in graph XAI research.

Abstract: Graph neural networks have become the de facto model for learning from structured data. However, the decision-making process of GNNs remains opaque to the end user, which undermines their use in safety-critical applications. Several explainable AI techniques for graphs have been developed to address this major issue. Focusing on graph classification, these explainers identify subgraph motifs that explain predictions. Therefore, a robust benchmarking of graph explainers is required to ensure that the produced explanations are of high quality, i.e., aligned with the GNN’s decision process. However, current graph-XAI benchmarks are limited to simplistic synthetic datasets or a few real-world tasks curated by domain experts, hindering rigorous and reproducible evaluation, and consequently stalling progress in the field. To overcome these limitations, we propose a method to automate the construction of graph XAI benchmarks from generic graph classification datasets. Our approach leverages the Weisfeiler-Leman color refinement algorithm to efficiently perform approximate subgraph matching and mine class-discriminating motifs, which serve as proxy ground-truth class explanations. At the same time, we ensure that these motifs can be learned by GNNs because their discriminating power aligns with WL expressiveness. This work also introduces the OpenGraphXAI benchmark suite, which consists of 15 ready-made graph-XAI datasets derived by applying our method to real-world molecular classification datasets. The suite is available to the public along with a codebase to generate over 2,000 additional graph-XAI benchmarks. Finally, we present a use case that illustrates how the suite can be used to assess the effectiveness of a selection of popular graph explainers, demonstrating the critical role of a sufficiently large benchmark collection for improving the significance of experimental results.

[393] ASGO: Adaptive Structured Gradient Optimization

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, Tong Zhang

Main category: cs.LG

TL;DR: ASGO is a novel optimization algorithm that leverages the structured properties of deep neural networks (low-rank gradients and block diagonal Hessians) through an adaptively updated preconditioner, achieving superior convergence rates compared to existing methods.

Details

Motivation: Current popular optimizers like Adam do not utilize the structural properties of deep neural networks where parameters are naturally represented as matrices/tensors, and gradients exhibit low-rank structure while Hessians are approximately block diagonal.

Method: ASGO employs a preconditioner that is adaptively updated using structured gradients, capitalizing on the low-rank gradient and block diagonal Hessian properties through fine-grained theoretical analysis.

Result: Theoretical analysis proves ASGO achieves superior convergence rates compared to existing structured gradient methods, and empirical verification on language model tasks demonstrates its effectiveness.

Conclusion: ASGO successfully leverages the structural properties of deep neural networks for more efficient optimization, providing both theoretical guarantees and practical effectiveness in language modeling tasks.

Abstract: Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By a fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on this convergence theory, we further demonstrate that ASGO can benefit from low-rank gradients and block diagonal Hessians. We also discuss practical modifications of ASGO and empirically verify ASGO’s effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO.

[394] Enlightenment Period Improving DNN Performance

Tiantian Liu, Meng Wan, Jue Wang, Ningming Nie

Main category: cs.LG

TL;DR: The paper identifies an ‘Enlightenment Period’ in early DNN training where Mixup data augmentation has dual effects: negative gradient interference but positive activation revival for saturated neurons. Three strategies are proposed to optimize training during this critical phase.

Details

Motivation: To understand and leverage the critical early training phase where neural network representations transition from disordered to ordered structure, and to address the dual effects of Mixup augmentation during this period.

Method: Theoretical modeling using phase transition theory and experimental validation to analyze Mixup’s effects. Proposed three strategies: Mixup Pause for small-scale, Alpha Boost for large-scale underfitting, and High-Loss Removal for Mixup-inapplicable tasks.

Result: Extensive experiments show superior performance across ViT and ResNet architectures on CIFAR and ImageNet-1K datasets. The strategies effectively balance Mixup’s dual effects based on dataset size and model parameters.

Conclusion: Strategic manipulation of training data distribution during the brief Enlightenment Period can significantly enhance model performance, offering a novel approach to optimize early training dynamics.

Abstract: The start of deep neural network training is characterized by a brief yet critical phase that lasts from the beginning of the training until the accuracy reaches approximately 50%. During this phase, disordered representations rapidly transition toward ordered structure, and we term this phase the Enlightenment Period. Through theoretical modeling based on phase transition theory and experimental validation, we reveal that applying Mixup data augmentation during this phase has a dual effect: it introduces a Gradient Interference Effect that hinders performance, while also providing a beneficial Activation Revival Effect to restore gradient updates for saturated neurons. We further demonstrate that this negative interference diminishes as the sample set size or the model parameter size increases, thereby shifting the balance between these two effects. Based on these findings, we propose three strategies that improve performance by solely adjusting the training data distribution within this brief period: the Mixup Pause Strategy for small-scale scenarios, the Alpha Boost Strategy for large-scale scenarios with underfitting, and the High-Loss Removal Strategy for tasks where Mixup is inapplicable (e.g., time series and large language models). Extensive experiments show that these strategies achieve superior performance across diverse architectures such as ViT and ResNet on datasets including CIFAR and ImageNet-1K. Ultimately, this work offers a novel perspective on enhancing model performance by strategically capitalizing on the dynamics of the brief and crucial early stages of training. Code is available at https://anonymous.4open.science/r/code-A5F1/.

[395] MDPs with a State Sensing Cost

Vansh Kapoor, Jayakrishnan Nair

Main category: cs.LG

TL;DR: The paper addresses sequential decision-making problems where sensing the state has a cost, proposing an MDP formulation with expanded state space and developing efficient algorithms with performance bounds.

Details

Motivation: Many real-world decision-making problems involve costs for sensing/communicating/computing state information, requiring a balance between optimal actions and sensing costs.

Method: Formulated as a discounted cost MDP with expanded infinite state space, derived lower bounds on optimal value function, and proposed SPI algorithm based on policy improvement.

Result: The proposed SPI algorithm performs close to optimal in practice, with derived bounds quantifying suboptimality gaps of policies.

Conclusion: The framework effectively balances sensing costs with decision quality, providing computationally efficient solutions with performance guarantees for practical applications.

Abstract: In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/communication/computation cost. In these settings, the agent’s interaction with its environment includes the additional component of deciding when to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining `blind’ to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we derive lower bounds on the optimal value function, which allow us to bound the suboptimality gap of any policy. We also propose a computationally efficient algorithm SPI, based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state-of-the-art via a numerical case study.

[396] SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning

Huanyu Liu, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong, Ge Li

Main category: cs.LG

TL;DR: Saturn is a SAT-based RL framework that uses Boolean Satisfiability problems to train LLMs’ reasoning capabilities, addressing scalability, verifiability, and difficulty control limitations of existing RL tasks.

Details

Motivation: Existing RL tasks for LLMs suffer from three key limitations: (1) Scalability - heavy reliance on human annotation or expensive LLM synthesis, (2) Verifiability - difficulty in automatically verifying LLM outputs, and (3) Controllable Difficulty - lack of fine-grained difficulty control for progressive training.

Method: Saturn uses Boolean Satisfiability (SAT) problems with a curriculum learning pipeline that constructs SAT tasks of increasing difficulty. It includes a principled mechanism to control difficulty transitions and creates Saturn-2.6k dataset with 2,660 SAT problems of varying difficulty.

Result: Saturn-1.5B and Saturn-7B models achieve: +14.0 and +28.1 average pass@3 improvements on SAT problems; +4.9 and +1.8 score improvements on math and programming benchmarks; +8.8% improvement over SOTA RL task construction approaches.

Conclusion: Saturn effectively addresses key limitations in RL task design for LLMs, enabling scalable training, reliable verification, and progressive difficulty control, leading to significant improvements in reasoning capabilities across multiple domains.

Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs’ outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs’ reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.

[397] Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

Sungmin Cha, Kyunghyun Cho

Main category: cs.LG

TL;DR: Knowledge distillation in generative models creates a precision-recall trade-off, improving sample quality at the expense of diversity coverage, which benefits scenarios prioritizing quality over diversity.

Details

Motivation: To understand the underlying mechanisms of how knowledge distillation improves generative quality in large language models, as empirical benefits are documented but mechanisms remain poorly understood.

Method: Used controlled simulations with mixtures of Gaussians and validated findings in large-scale language modeling using SmolLM2 family models, analyzing the precision-recall dynamics modulated by entropy-controlling parameters.

Result: Demonstrated that distillation induces a precision-recall trade-off where students concentrate probability mass on high-likelihood regions (improving precision/sample quality) at the expense of coverage (reducing recall/diversity).

Conclusion: Knowledge distillation’s effectiveness in generative modeling stems from a simple precision-recall trade-off mechanism, particularly beneficial when sample quality is prioritized over diversity in applications like instruction tuning.

Abstract: Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented – enabling smaller student models to emulate the performance of much larger teachers – the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage – a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off in LLMs is found to be especially beneficial in scenarios where sample quality is more important than diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.

[398] Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

Main category: cs.LG

TL;DR: LrcSSM is a non-linear recurrent model that achieves linear time/memory complexity and logarithmic sequential depth for long sequences by using a diagonal Jacobian structure, enabling parallel processing while maintaining performance and providing gradient stability guarantees.

Details

Motivation: To develop a non-linear recurrent model that can process long sequences as efficiently as linear state-space models while providing formal gradient stability guarantees that other input-varying systems like Liquid-S4 and Mamba lack.

Method: Forces the Jacobian matrix to be diagonal, allowing full sequence processing in parallel with O(TD) time/memory and O(log T) sequential depth. The diagonal Jacobian structure enables efficient computation without performance loss compared to dense Jacobian models.

Result: Outperforms Transformers, LRU, S5, and Mamba on long-range forecasting tasks. The approach maintains performance while providing formal gradient stability guarantees.

Conclusion: LrcSSM demonstrates that non-linear recurrent models can achieve linear sequence processing efficiency through diagonal Jacobian structure, offering both performance advantages and theoretical guarantees while being generalizable to other non-linear recurrent models.

Abstract: We present LrcSSM, a $\textit{non-linear}$ recurrent model that processes long sequences as fast as today’s linear state-space layers. By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Importantly, the diagonal Jacobian structure of our model results in no performance loss compared to the original model with dense Jacobian, and the approach can be generalized to other non-linear recurrent models, demonstrating broader applicability. On a suite of long-range forecasting tasks, we demonstrate that LrcSSM outperforms Transformers, LRU, S5, and Mamba.

[399] Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking

Yuatyong Chaichana, Thanapat Trachu, Peerat Limkonchotiwat, Konpat Preechakul, Tirasan Khandhawit, Ekapol Chuangsuwanich

Main category: cs.LG

TL;DR: DRM is a model merging method that uses Singular Value Decomposition to align weight matrices from different models into a joint space, enabling effective entry-wise merging even when neurons have different feature compositions.

Details

Motivation: Existing model merging methods assume identical positions in weight matrices serve the same function, but this overlooks the complexity of finetuned networks where neurons develop distinct feature compositions, making direct entry-wise merging problematic.

Method: Decom-Renorm-Merge (DRM) uses Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, with renormalization as a crucial component for creating a robust joint space.

Result: DRM outperforms several state-of-the-art merging techniques across various model types (ViT, DeBERTa, T5, Llama3.1-8B) in both full finetuning and low-rank adaptation settings.

Conclusion: Renormalization is identified as the crucial component for creating a robust joint space for merging, significantly contributing to the method’s performance across diverse model architectures.

Abstract: In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method’s performance.

[400] Learning with Calibration: Exploring Test-Time Computing of Spatio-Temporal Forecasting

Wei Chen, Yuxuan Liang

Main category: cs.LG

TL;DR: ST-TTC is a novel test-time computing paradigm for spatio-temporal forecasting that uses calibration to capture periodic structural biases and perform real-time bias correction, offering an efficient alternative to complex training-stage methods.

Details

Motivation: Real-world spatio-temporal forecasting faces challenges like signal anomalies, noise, and distributional shifts. Existing solutions are computationally intensive and resource-demanding, especially for large-scale applications.

Method: Introduces a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift, and a flash updating mechanism with streaming memory queue for efficient test-time computation.

Result: Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of the proposed method.

Conclusion: ST-TTC effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm for spatio-temporal forecasting.

Abstract: Spatio-temporal forecasting is crucial in many domains, such as transportation, meteorology, and energy. However, real-world scenarios frequently present challenges such as signal anomalies, noise, and distributional shifts. Existing solutions primarily enhance robustness by modifying network architectures or training procedures. Nevertheless, these approaches are computationally intensive and resource-demanding, especially for large-scale applications. In this paper, we explore a novel test-time computing paradigm, namely learning with calibration, ST-TTC, for spatio-temporal forecasting. Through learning with calibration, we aim to capture periodic structural biases arising from non-stationarity during the testing phase and perform real-time bias correction on predictions to improve accuracy. Specifically, we first introduce a spectral-domain calibrator with phase-amplitude modulation to mitigate periodic shift and then propose a flash updating mechanism with a streaming memory queue for efficient test-time computation. ST-TTC effectively bypasses complex training-stage techniques, offering an efficient and generalizable paradigm. Extensive experiments on real-world datasets demonstrate the effectiveness, universality, flexibility and efficiency of our proposed method.

[401] DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer

Yekun Zhu, Min Tang, Zheng Ma

Main category: cs.LG

TL;DR: DeepRTE is a novel neural network approach that efficiently solves the steady-state Radiative Transfer Equation (RTE) using physics-informed architecture and achieves high accuracy with fewer parameters through multi-head attention mechanisms.

Details

Motivation: The RTE governs radiation propagation in participating media and has applications in neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Traditional methods and existing neural approaches lack computational efficiency for solving this complex differential-integral equation.

Method: DeepRTE embeds physical information through RTE derivation and mathematically-informed network architecture. It incorporates multi-head attention mechanisms, Green’s function theory, and pre-training with delta-function inflow boundary conditions. It’s a mesh-free neural operator framework with zero-shot capability.

Result: DeepRTE demonstrates superior computational efficiency compared to traditional methods and existing neural network approaches. It achieves high accuracy with significantly fewer parameters. Comprehensive numerical experiments substantiate its efficacy.

Conclusion: The proposed DeepRTE framework provides an efficient and accurate solution for the steady-state RTE, offering computational advantages through its physics-informed design and zero-shot capability.

Abstract: In this paper, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our DeepRTE framework demonstrates superior computational efficiency for solving the steady-state RTE, surpassing traditional methods and existing neural network approaches. This efficiency is achieved by embedding physical information through derivation of the RTE and mathematically-informed network architecture. Concurrently, DeepRTE achieves high accuracy with significantly fewer parameters, largely due to its incorporation of mechanisms such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural operator framework with inherent zero-shot capability. This is achieved by incorporating Green’s function theory and pre-training with delta-function inflow boundary conditions into both its architecture design and training data construction. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.

[402] Doubly Robust Alignment for Large Language Models

Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi

Main category: cs.LG

TL;DR: Proposes a doubly robust preference optimization algorithm for RLHF that remains consistent when either the preference model or reference policy is correctly specified, addressing model misspecification issues.

Details

Motivation: RLHF algorithms are highly sensitive to misspecifications in preference models, reference policies, or reward functions, leading to undesirable fine-tuning outcomes.

Method: Developed a doubly robust preference optimization algorithm that works correctly when either the preference model or reference policy is properly specified, without requiring both to be correct.

Result: The proposed algorithm demonstrates superior and more robust performance than state-of-the-art methods in both theoretical analysis and practical experiments.

Conclusion: Doubly robust preference optimization provides a more reliable approach for RLHF that mitigates the impact of model misspecification, improving alignment of large language models with human preferences.

Abstract: This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

[403] Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, Azmat Azati

Main category: cs.LG

TL;DR: PSQL combines Q-learning with posterior sampling using Gaussian distributions on Q-values, achieving near-optimal regret bound of Õ(H²√SAT) in tabular episodic MDPs.

Details

Motivation: Bayesian posterior sampling shows strong empirical performance in exploration-exploitation settings but lacks theoretical analysis, especially in complex RL environments.

Method: Q-Learning with Posterior Sampling (PSQL) - a Q-learning algorithm that uses Gaussian posterior distributions on Q-values for exploration, similar to Thompson Sampling in bandits.

Result: PSQL achieves a regret bound of Õ(H²√SAT) in tabular episodic MDPs, closely matching the known lower bound of Ω(H√SAT).

Conclusion: The work provides technical insights into combining posterior sampling with RL algorithms and serves as a foundation for analyzing this technique in more complex RL settings.

Abstract: Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $\Omega(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

[404] Purifying Shampoo: Investigating Shampoo’s Heuristics by Decomposing its Preconditioner

Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, Hao-Jun Michael Shi

Main category: cs.LG

TL;DR: This paper analyzes Shampoo’s heuristics and proposes principled improvements by decoupling preconditioner eigenvalue/eigenbasis updates, correcting eigenvalues to eliminate grafting, and using adaptive eigenbasis computation frequency.

Details

Motivation: Shampoo's success relies on heuristics like learning rate grafting and stale preconditioning that increase complexity, require hyperparameter tuning, and lack theoretical justification.

Method: Decouples preconditioner’s eigenvalues and eigenbasis updates, corrects eigenvalues directly, and proposes adaptive criterion for eigenbasis computation frequency using warm-started QR algorithm.

Result: Shows that grafting from Adam mitigates preconditioner staleness and mis-scaling, and that eigenvalue correction eliminates need for learning rate grafting.

Conclusion: The proposed techniques provide a principled approach to remove Shampoo’s heuristics and develop improved Kronecker-factorization-based training algorithms.

Abstract: The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner’s eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner’s eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo’s heuristics and developing improved Kronecker-factorization-based training algorithms.

[405] Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

Xingyu Chen, Bokun Wang, Ming Yang, Qihang Lin, Tianbao Yang

Main category: cs.LG

TL;DR: The paper proposes stochastic momentum methods for non-smooth finite-sum coupled compositional optimization (FCCO) problems, achieving improved iteration complexity of O(1/ε⁵) and applying them to constrained optimization with functional inequality constraints.

Details

Motivation: Existing methods for non-convex non-smooth FCCO face high iteration complexity (O(1/ε⁶)) and use vanilla SGD-type updates unsuitable for deep learning. The paper aims to address these limitations.

Method: Proposed stochastic momentum methods tailored for non-smooth FCCO, with provable convergence guarantees. Applied to constrained optimization by optimizing a smoothed hinge penalty formulation.

Result: Achieved new state-of-the-art iteration complexity of O(1/ε⁵) for FCCO and for finding (nearly) ε-level KKT solutions in constrained optimization. Experiments on three tasks demonstrated effectiveness.

Conclusion: The proposed stochastic momentum methods significantly improve iteration complexity for non-smooth FCCO problems and provide efficient solutions for constrained optimization with functional inequality constraints.

Abstract: Finite-sum Coupled Compositional Optimization (FCCO), characterized by its coupled compositional objective structure, emerges as an important optimization paradigm for addressing a wide range of machine learning problems. In this paper, we focus on a challenging class of non-convex non-smooth FCCO, where the outer functions are non-smooth weakly convex or convex and the inner functions are smooth or weakly convex. Existing state-of-the-art result face two key limitations: (1) a high iteration complexity of $O(1/\epsilon^6)$ under the assumption that the stochastic inner functions are Lipschitz continuous in expectation; (2) reliance on vanilla SGD-type updates, which are not suitable for deep learning applications. Our main contributions are two fold: (i) We propose stochastic momentum methods tailored for non-smooth FCCO that come with provable convergence guarantees; (ii) We establish a new state-of-the-art iteration complexity of $O(1/\epsilon^5)$. Moreover, we apply our algorithms to multiple inequality constrained non-convex optimization problems involving smooth or weakly convex functional inequality constraints. By optimizing a smoothed hinge penalty based formulation, we achieve a new state-of-the-art complexity of $O(1/\epsilon^5)$ for finding an (nearly) $\epsilon$-level KKT solution. Experiments on three tasks demonstrate the effectiveness of the proposed algorithms.

[406] Learning single-index models via harmonic decomposition

Nirmit Joshi, Hugo Koubbi, Theodor Misiakiewicz, Nathan Srebro

Main category: cs.LG

TL;DR: The paper proposes using spherical harmonics instead of Hermite polynomials to analyze single-index models, revealing that rotational symmetry is key to understanding learning complexity under spherically symmetric input distributions.

Details

Motivation: Prior work used Hermite polynomials under Gaussian inputs, but the authors argue spherical harmonics better capture the intrinsic rotational symmetry of single-index models, enabling analysis of arbitrary spherically symmetric distributions.

Method: Proposed two estimator families: one based on tensor unfolding for optimal sample complexity, and another using online SGD for optimal runtime. Analyzed learning complexity through spherical harmonics framework.

Result: Characterized complexity of learning single-index models under spherically symmetric inputs. For Gaussian inputs, recovered existing results while revealing new previously overlooked phenomena. Showed estimators achieving both optimal sample complexity and runtime may not exist.

Conclusion: Spherical harmonics provide the natural basis for single-index models due to rotational symmetry. The framework enables comprehensive complexity analysis across different spherically symmetric distributions, with trade-offs between statistical and computational efficiency.

Abstract: We study the problem of learning single-index models, where the label $y \in \mathbb{R}$ depends on the input $\boldsymbol{x} \in \mathbb{R}^d$ only through an unknown one-dimensional projection $\langle \boldsymbol{w}*,\boldsymbol{x}\rangle$. Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering $\boldsymbol{w}*$ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that $spherical$ $harmonics$ – rather than $Hermite$ $polynomials$ – provide the natural basis for this problem, as they capture its intrinsic $rotational$ $symmetry$. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators – based on tensor unfolding and online SGD – that respectively achieve either optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.

[407] TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter

Main category: cs.LG

TL;DR: TabArena is introduced as the first continuously maintained living tabular benchmarking system to address the limitations of static benchmarks in deep learning for tabular data.

Details

Motivation: Current tabular benchmarks are static and not updated when flaws are discovered, models are updated, or new models are released, creating a need for a continuously maintained benchmarking system.

Method: Manually curated representative datasets and well-implemented models, conducted large-scale benchmarking to initialize a public leaderboard, and assembled a team of experienced maintainers with reproducible code and maintenance protocols.

Result: Gradient-boosted trees remain strong contenders, deep learning methods catch up with larger time budgets and ensembling, foundation models excel on smaller datasets, and cross-model ensembles advance state-of-the-art, though some deep learning models are overrepresented due to validation set overfitting.

Conclusion: TabArena provides a living benchmark with public leaderboard to continuously evaluate tabular machine learning models, encouraging model developers to address validation set overfitting issues in cross-model ensembles.

Abstract: With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

[408] Path-specific effects for pulse-oximetry guided decisions in critical care

Kevin Zhang, Yonghan Jung, Divyat Mahajan, Karthikeyan Shanmugam, Shalmali Joshi

Main category: cs.LG

TL;DR: This paper uses causal inference methods to investigate racial bias in pulse oximeter readings and its impact on invasive ventilation decisions in ICU settings, finding minimal effect on ventilation rates but more pronounced impacts on ventilation duration.

Details

Motivation: To address racial disparities in healthcare, specifically inaccurate pulse oximeter readings that overestimate oxygen saturation for dark-skinned patients, and to move beyond statistical correlations to establish causal relationships in clinical decision-making.

Method: Employed causal inference with path-specific effects to isolate racial bias impact, used doubly robust estimator with a self-normalized variant for improved efficiency, and provided finite-sample guarantees. Validated on semi-synthetic data and applied to MIMIC-IV and eICU datasets.

Result: Contrary to prior work, found minimal impact of racial discrepancies on invasive ventilation rates, but path-specific effects mediated by oxygen saturation disparity showed more pronounced impact on ventilation duration, with severity varying by dataset.

Conclusion: Provides a novel pipeline for investigating clinical decision-making disparities and highlights the necessity of causal methods for robust fairness assessment in healthcare.

Abstract: Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs by dataset. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.

[409] Jailbreak Transferability Emerges from Shared Representations

Rico Angell, Jannik Brinkmann, He He

Main category: cs.LG

TL;DR: Jailbreak transferability between models stems from shared representations rather than training flaws, with transfer systematically shaped by representational similarity and source jailbreak strength.

Details

Motivation: To understand why adversarial jailbreak attacks transfer between different models - whether it's due to safety training artifacts, model family similarities, or fundamental representation learning properties.

Method: Analyzed 20 open-weight models and 33 jailbreak attacks, measuring representational similarity under benign prompts and jailbreak strength. Conducted causal experiments using benign-only distillation to deliberately increase similarity and observe transfer effects.

Result: Found two systematic factors driving transfer: representational similarity under benign prompts and source jailbreak strength. Persona-style jailbreaks transfer more often than cipher-based prompts. Deliberately increasing similarity through distillation causally increases transfer.

Conclusion: Jailbreak transfer is a consequence of representation alignment rather than a fragile byproduct of safety training, with natural-language attacks exploiting shared representation spaces while cipher-based attacks rely on idiosyncratic quirks.

Abstract: Jailbreak transferability is the surprising phenomenon when an adversarial attack compromising one model also elicits harmful responses from other models. Despite widespread demonstrations, there is little consensus on why transfer is possible: is it a quirk of safety training, an artifact of model families, or a more fundamental property of representation learning? We present evidence that transferability emerges from shared representations rather than incidental flaws. Across 20 open-weight models and 33 jailbreak attacks, we find two factors that systematically shape transfer: (1) representational similarity under benign prompts, and (2) the strength of the jailbreak on the source model. To move beyond correlation, we show that deliberately increasing similarity through benign only distillation causally increases transfer. Our qualitative analyses reveal systematic transferability patterns across different types of jailbreaks. For example, persona-style jailbreaks transfer far more often than cipher-based prompts, consistent with the idea that natural-language attacks exploit models’ shared representation space, whereas cipher-based attacks rely on idiosyncratic quirks that do not generalize. Together, these results reframe jailbreak transfer as a consequence of representation alignment rather than a fragile byproduct of safety training.

[410] Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi

Main category: cs.LG

TL;DR: Bohdi is a synthetic-data-only heterogeneous LLM fusion framework that uses hierarchical domain organization and adaptive sampling to overcome limitations of existing methods, achieving better performance and eliminating capability imbalance.

Details

Motivation: Existing heterogeneous LLM fusion methods suffer from reliance on limited real data and fixed domain allocation proportions, preventing comprehensive knowledge acquisition and causing capability imbalances across domains.

Method: Organizes knowledge domains into hierarchical tree structure, uses multi-model collaboration for automatic domain exploration and data generation, formalizes domain expansion as Hierarchical Multi-Armed Bandit problem, and employs DynaBranches mechanism with Introspection-Rebirth for adaptive sampling based on performance feedback.

Result: Significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates capability imbalance across domains.

Conclusion: Bohdi provides an effective synthetic-data-only framework for heterogeneous LLM fusion that dynamically adapts to target LLM’s capabilities and achieves balanced performance across diverse domains.

Abstract: Heterogeneous Large Language Model (LLM) fusion integrates the strengths of multiple source LLMs with different architectures into a target LLM with low computational overhead. While promising, existing methods suffer from two major limitations: 1) reliance on real data from limited domain for knowledge fusion, preventing the target LLM from fully acquiring knowledge across diverse domains, and 2) fixed data allocation proportions across domains, failing to dynamically adjust according to the target LLM’s varying capabilities across domains, leading to a capability imbalance. To overcome these limitations, we propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework. Through the organization of knowledge domains into a hierarchical tree structure, Bohdi enables automatic domain exploration and multi-domain data generation through multi-model collaboration, thereby comprehensively extracting knowledge from source LLMs. By formalizing domain expansion and data sampling proportion allocation on the knowledge tree as a Hierarchical Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism to adaptively adjust sampling proportions based on the target LLM’s performance feedback across domains. Integrated with our proposed Introspection-Rebirth (IR) mechanism, DynaBranches dynamically tracks capability shifts during target LLM’s updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT), further enhancing its online adaptation capability. Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM’s capabilities. Our code is available at https://github.com/gjq100/Bohdi.git.

[411] Mesh-Informed Neural Operator : A Transformer Generative Approach

Yaozhong Shi, Zachary E. Ross, Domniki Asimaki, Kamyar Azizzadenesheli

Main category: cs.LG

TL;DR: Introduces Mesh-Informed Neural Operator (MINO) to overcome limitations of Fourier Neural Operator in functional generative models, enabling domain- and discretization-agnostic modeling on irregular grids and diverse domains.

Details

Motivation: Current functional generative models rely heavily on Fourier Neural Operator, which limits applicability to regular grids and rectangular domains, restricting their use in diverse scientific and engineering applications.

Method: Develops MINO using graph neural operators and cross-attention mechanisms to create a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces.

Result: MINO significantly expands the scope of functional generative models to more diverse applications in generative, inverse, and regression tasks, and provides a unified perspective for integrating neural operators with advanced deep learning architectures.

Conclusion: The paper presents MINO as a breakthrough that overcomes critical limitations of existing approaches and introduces standardized evaluation metrics to enable objective comparison of functional generative models.

Abstract: Generative models in function spaces, situated at the intersection of generative modeling and operator learning, are attracting increasing attention due to their immense potential in diverse scientific and engineering applications. While functional generative models are theoretically domain- and discretization-agnostic, current implementations heavily rely on the Fourier Neural Operator (FNO), limiting their applicability to regular grids and rectangular domains. To overcome these critical limitations, we introduce the Mesh-Informed Neural Operator (MINO). By leveraging graph neural operators and cross-attention mechanisms, MINO offers a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces. This advancement significantly expands the scope of such models to more diverse applications in generative, inverse, and regression tasks. Furthermore, MINO provides a unified perspective on integrating neural operators with general advanced deep learning architectures. Finally, we introduce a suite of standardized evaluation metrics that enable objective comparison of functional generative models, addressing another critical gap in the field.

[412] SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor

Main category: cs.LG

TL;DR: SMMILE is the first expert-driven multimodal in-context learning benchmark for medical tasks, revealing that current multimodal LLMs have limited ability to learn from medical context examples and are susceptible to irrelevant examples and recency bias.

Details

Motivation: Multimodal in-context learning remains underexplored in medicine despite clinicians' routine need to adapt from limited examples. Current MLLMs show advances in medical VQA but their ability to learn multimodal tasks from context is unknown.

Method: Created SMMILE benchmark with 111 problems (517 question-image-answer triplets) across 6 medical specialties and 13 imaging modalities, curated by 11 medical experts. Also developed SMMILE++ with 1038 permuted problems. Evaluated 15 MLLMs on their multimodal ICL ability.

Result: Most MLLMs show moderate to poor multimodal ICL ability. ICL provides only 8-9.4% improvement over zero-shot. Models are susceptible to irrelevant examples (single noisy example degrades performance by up to 9.5%) and recency bias (placing relevant example last improves performance by up to 71%).

Conclusion: Current MLLMs have critical limitations and biases when learning multimodal medical tasks from context, highlighting the need for improved multimodal in-context learning capabilities in medical AI systems.

Abstract: Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.

[413] Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs

Erfan Pirmorad

Main category: cs.LG

TL;DR: LLMs can effectively reason over financial knowledge graphs for anti-money laundering by analyzing localized subgraphs and providing explainable suspiciousness assessments.

Details

Motivation: Money laundering involves complex interconnected entities that require graph-based reasoning, and LLMs offer potential for explainable financial crime analysis.

Method: A lightweight pipeline that extracts k-hop neighborhoods from financial knowledge graphs, serializes them into structured text, and uses few-shot in-context learning with LLMs to assess suspiciousness.

Result: LLMs successfully emulate analyst-style reasoning, identify red flags, and generate coherent explanations for suspicious activities in synthetic AML scenarios.

Conclusion: LLM-based graph reasoning shows promise for explainable financial crime analytics, though this remains an exploratory study that lays groundwork for future applications.

Abstract: The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.

[414] Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations

Yuchao Lin, Cong Fu, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Main category: cs.LG

TL;DR: TDNs replace computationally expensive Clebsch-Gordan tensor products in SO(3)-equivariant networks with low-rank tensor decompositions, achieving dramatic speedup while maintaining competitive performance on molecular datasets.

Details

Motivation: Clebsch-Gordan tensor products in SO(3)-equivariant networks are computationally expensive, limiting their efficiency for machine learning interatomic potentials.

Method: Develop tensor decomposition networks (TDNs) using low-rank tensor decompositions like CP decomposition, with path-weight sharing to reduce parameters while maintaining equivariance.

Result: TDNs reduce computational complexity from O(L^6) to O(L^4), achieve competitive performance on PubChemQCR, OC20, and OC22 datasets with dramatic speedup.

Conclusion: TDNs provide an efficient plug-and-play replacement for tensor products in existing networks, enabling faster computations without compromising performance.

Abstract: $\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks in which CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $\mathcal{O}(L^3)$ CG paths into a single path without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations. Our code is publicly available as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenMol/TDN}{https://github.com/divelab/AIRS/}).

[415] The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection

Steven A. Frank

Main category: cs.LG

TL;DR: The paper presents a universal force-metric-bias (FMB) law that unifies diverse learning algorithms, optimization methods, and natural selection through the Price equation framework.

Details

Motivation: To reveal the common mathematical structure shared by seemingly different learning algorithms, optimization methods, and natural selection processes.

Method: Uses the Price equation to partition change into a force-metric-bias framework: Δθ = Mf + b + ξ, where force drives improvement, metric rescales movement, bias adds momentum, and noise enables exploration.

Result: Shows that natural selection, Bayesian updating, Newton’s method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms are special cases of this universal FMB law.

Conclusion: The FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines by exposing their common underlying structure.

Abstract: Diverse learning algorithms, optimization methods, and natural selection share a common mathematical structure, despite their apparent differences. Here I show that a simple notational partitioning of change by the Price equation reveals a universal force-metric-bias (FMB) law: $\Delta\mathbf{\theta} = \mathbf{M},\mathbf{f} + \mathbf{b} + \mathbf{\xi}$. The force $\mathbf{f}$ drives improvement in parameters, $\Delta\mathbf{\theta}$, in proportion to the slope of performance with respect to the parameters. The metric $\mathbf{M}$ rescales movement by inverse curvature. The bias $\mathbf{b}$ adds momentum or changes in the frame of reference. The noise $\mathbf{\xi}$ enables exploration. This framework unifies natural selection, Bayesian updating, Newton’s method, stochastic gradient descent, stochastic Langevin dynamics, Adam optimization, and most other algorithms as special cases of the same underlying process. The Price equation also reveals why Fisher information, Kullback-Leibler divergence, and d’Alembert’s principle arise naturally in learning dynamics. By exposing this common structure, the FMB law provides a principled foundation for understanding, comparing, and designing learning algorithms across disciplines.

[416] Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction

Rodrigo Tertulino, Ricardo Almeida

Main category: cs.LG

TL;DR: A privacy-preserving recommender system using Federated Learning achieves 92% of centralized model performance while protecting student data privacy.

Details

Motivation: Address the conflict between data-driven personalization in education and student data privacy requirements under modern regulations.

Method: Used Federated Learning with Deep Neural Network on ASSISTments dataset, comparing FedProx vs FedAvg aggregation strategies.

Result: FedProx proved more stable than FedAvg, achieving 76.28% F1-Score (92% of centralized XGBoost performance).

Conclusion: Federated Learning provides an effective solution to the personalization-privacy dilemma in educational platforms.

Abstract: The increasing digitalization of education presents unprecedented opportunities for data-driven personalization, but it also introduces significant challenges to student data privacy. Conventional recommender systems rely on centralized data, a paradigm often incompatible with modern data protection regulations. A novel privacy-preserving recommender system is proposed and evaluated to address this critical issue using Federated Learning (FL). The approach utilizes a Deep Neural Network (DNN) with rich, engineered features from the large-scale ASSISTments educational dataset. A rigorous comparative analysis of federated aggregation strategies was conducted, identifying FedProx as a significantly more stable and effective method for handling heterogeneous student data than the standard FedAvg baseline. The optimized federated model achieves a high-performance F1-Score of 76.28%, corresponding to 92% of the performance of a powerful, centralized XGBoost model. These findings validate that a federated approach can provide highly effective content recommendations without centralizing sensitive student data. Consequently, our work presents a viable and robust solution to the personalization-privacy dilemma in modern educational platforms.

[417] GnnXemplar: Exemplars to Explanations – Natural Language Rules for Global GNN Interpretability

Burouj Armgaan, Eshan Jain, Harsh Pandey, Mahesh Chandran, Sayan Ranu

Main category: cs.LG

TL;DR: GnnXemplar is a novel global explainer for GNNs that identifies representative nodes (exemplars) in the embedding space and generates natural language rules from their neighborhoods using LLMs, outperforming existing methods in fidelity, scalability, and human interpretability.

Details

Motivation: GNNs are widely used but their opaque decision-making limits trust. While local explanations exist, global explanation methods that characterize entire classes remain underdeveloped, especially for large real-world graphs where subgraph repetition is rare and node attributes are high-dimensional.

Method: GnnXemplar identifies exemplars (representative nodes) in the GNN embedding space using a coverage maximization problem over reverse k-nearest neighbors with efficient greedy approximation. It then generates interpretable natural language rules using a self-refining prompt strategy with LLMs.

Result: Experiments across diverse benchmarks show GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.

Conclusion: GnnXemplar provides an effective global explanation framework for GNNs that addresses limitations of existing methods and demonstrates superior performance in real-world settings through its exemplar-based approach and LLM-powered rule generation.

Abstract: Graph Neural Networks (GNNs) are widely used for node classification, yet their opaque decision-making limits trust and adoption. While local explanations offer insights into individual predictions, global explanation methods, those that characterize an entire class, remain underdeveloped. Existing global explainers rely on motif discovery in small graphs, an approach that breaks down in large, real-world settings where subgraph repetition is rare, node attributes are high-dimensional, and predictions arise from complex structure-attribute interactions. We propose GnnXemplar, a novel global explainer inspired from Exemplar Theory from cognitive science. GnnXemplar identifies representative nodes in the GNN embedding space, exemplars, and explains predictions using natural language rules derived from their neighborhoods. Exemplar selection is framed as a coverage maximization problem over reverse k-nearest neighbors, for which we provide an efficient greedy approximation. To derive interpretable rules, we employ a self-refining prompt strategy using large language models (LLMs). Experiments across diverse benchmarks show that GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.

[418] Graph Mixing Additive Networks

Maya Bechler-Speicher, Andrea Zerio, Maor Huri, Marie Vibeke Vestergaard, Ran Gilad-Bachrach, Tine Jess, Samir Bhatt, Aleksejs Sazonovs

Main category: cs.LG

TL;DR: GMAN extends Graph Neural Additive Networks to learn from sparse time-series data by representing trajectories as directed graphs, providing flexible interpretability-expressivity trade-off control.

Details

Motivation: To create an interpretable framework for learning from sets of sparse time-series data while maintaining competitive performance against black-box models.

Method: Represents time-dependent trajectories as directed graphs and applies enriched Graph Neural Additive Networks to each graph, allowing feature grouping and multi-level interpretability.

Result: Outperforms strong non-interpretable black-box baselines on real-world datasets including mortality prediction from blood tests and fake-news detection.

Conclusion: GMAN provides actionable, domain-aligned explanations while achieving superior performance compared to black-box models, demonstrating the viability of interpretable frameworks for complex time-series tasks.

Abstract: We introduce GMAN, a flexible, interpretable, and expressive framework that extends Graph Neural Additive Networks (GNANs) to learn from sets of sparse time-series data. GMAN represents each time-dependent trajectory as a directed graph and applies an enriched, more expressive GNAN to each graph. It allows users to control the interpretability-expressivity trade-off by grouping features and graphs to encode priors, and it provides feature, node, and graph-level interpretability. On real-world datasets, including mortality prediction from blood tests and fake-news detection, GMAN outperforms strong non-interpretable black-box baselines while delivering actionable, domain-aligned explanations.

[419] Lift What You Can: Green Online Learning with Heterogeneous Ensembles

Kirsten Köbschall, Sebastian Buschjäger, Raphael Fischer, Lisa Hartung, Stefan Kramer

Main category: cs.LG

TL;DR: HEROS is a heterogeneous online ensemble method that selects subsets of models for training under resource constraints, balancing predictive performance with sustainability through a novel ζ-policy that achieves near-optimal performance with fewer resources.

Details

Motivation: Current ensemble methods for stream mining focus too much on predictive capabilities while ignoring computational expenses and sustainability concerns, creating a need for more resource-efficient approaches.

Method: HEROS uses a Markov decision process framework to model trade-offs between performance and sustainability. It selects subsets of models from a diverse pool under resource constraints, with the novel ζ-policy focusing on training near-optimal models at reduced costs.

Result: Theoretical analysis proves ζ-policy achieves near-optimal performance with fewer resources. Experiments on 11 benchmark datasets show HEROS provides highly accurate performance (sometimes outperforming competitors) while being much more resource-friendly.

Conclusion: HEROS successfully addresses the sustainability challenge in ensemble stream mining by balancing predictive performance with computational efficiency, with the ζ-policy being a strong contribution to state-of-the-art methods.

Abstract: Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members’ computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel $\zeta$-policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our $\zeta$-policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our $\zeta$-policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.

[420] FedCLF – Towards Efficient Participant Selection for Federated Learning in Heterogeneous IoV Networks

Kasun Eranda Wijethilake, Adnan Mahmood, Quan Z. Sheng

Main category: cs.LG

TL;DR: FedCLF is a federated learning approach for IoV networks that uses calibrated loss for participant selection and feedback control for dynamic sampling frequency adjustment, achieving 16% better accuracy in high heterogeneity scenarios.

Details

Motivation: FL faces challenges in IoV networks due to high data and device heterogeneity, which affects model accuracy and resource efficiency in dynamic environments.

Method: FedCLF introduces calibrated loss as utility for participant selection and a feedback control mechanism to dynamically adjust client sampling frequency.

Result: FedCLF outperforms baseline models (FedAvg, Newt, Oort) by up to 16% improvement in high data heterogeneity scenarios with improved efficiency via reduced sampling frequency.

Conclusion: FedCLF effectively addresses FL challenges in IoV networks by enhancing model accuracy under data heterogeneity and optimizing resource utilization through adaptive sampling.

Abstract: Federated Learning (FL) is a distributed machine learning technique that preserves data privacy by sharing only the trained parameters instead of the client data. This makes FL ideal for highly dynamic, heterogeneous, and time-critical applications, in particular, the Internet of Vehicles (IoV) networks. However, FL encounters considerable challenges in such networks owing to the high data and device heterogeneity. To address these challenges, we propose FedCLF, i.e., FL with Calibrated Loss and Feedback control, which introduces calibrated loss as a utility in the participant selection process and a feedback control mechanism to dynamically adjust the sampling frequency of the clients. The envisaged approach (a) enhances the overall model accuracy in case of highly heterogeneous data and (b) optimizes the resource utilization for resource constrained IoV networks, thereby leading to increased efficiency in the FL process. We evaluated FedCLF vis-`a-vis baseline models, i.e., FedAvg, Newt, and Oort, using CIFAR-10 dataset with varying data heterogeneity. Our results depict that FedCLF significantly outperforms the baseline models by up to a 16% improvement in high data heterogeneity-related scenarios with improved efficiency via reduced sampling frequency.

[421] Curiosity-driven RL for symbolic equation solving

Kevin P. O’Keeffe

Main category: cs.LG

TL;DR: RL with PPO, curiosity exploration, and graph-based actions can solve nonlinear equations including radicals, exponentials, and trig functions.

Details

Motivation: To explore if reinforcement learning can be useful for symbolic mathematics beyond simple linear equations.

Method: Used model-free PPO augmented with curiosity-based exploration and graph-based actions.

Result: Successfully solved nonlinear equations involving radicals, exponentials, and trigonometric functions.

Conclusion: Curiosity-based exploration may be useful for general symbolic reasoning tasks.

Abstract: We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.

[422] 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

Main category: cs.LG

TL;DR: A 3D optimization framework for AI inference scaling that jointly optimizes accuracy, cost, and latency, overcoming limitations of traditional 1D and 2D approaches.

Details

Motivation: Traditional AI inference scaling uses 1D heuristics or 2D trade-offs that fail to consider cost and latency constraints, leading to suboptimal deployment decisions.

Method: Monte Carlo simulations across three scenarios and nine simulated LLMs, evaluating four optimization methods for 3D multi-objective optimization (MOO) to enable constraints-aware inference scaling.

Result: Knee-point optimization achieves the best balance across objectives, while accuracy-maximization remains favorable when precision is prioritized. The framework captures feasible spaces that 1D/2D optimizations miss.

Conclusion: The 3D MOO framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts, enabling environment-adaptive selection of inference scaling parameters.

Abstract: AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.

[423] AL-CoLe: Augmented Lagrangian for Constrained Learning

Ignacio Boero, Ignacio Hounie, Alejandro Ribeiro

Main category: cs.LG

TL;DR: Augmented Lagrangian methods are effective for constrained machine learning problems, providing strong duality, convergence guarantees, and good performance on fairness tasks.

Details

Motivation: Lagrangian duality is popular for constrained learning, but Augmented Lagrangian methods remain underexplored despite their ability to mitigate duality gaps in non-convex settings with minimal modifications.

Method: Augmented Lagrangian methods that require only minimal modifications to address constrained learning problems in non-convex machine learning parameterizations.

Result: Established strong duality under mild conditions, proved convergence of dual ascent algorithms to feasible and optimal primal solutions, and provided PAC-style generalization guarantees.

Conclusion: Augmented Lagrangian methods are effective for constrained learning, demonstrated by strong performance on fairness constrained classification tasks with theoretical guarantees.

Abstract: Despite the non-convexity of most modern machine learning parameterizations, Lagrangian duality has become a popular tool for addressing constrained learning problems. We revisit Augmented Lagrangian methods, which aim to mitigate the duality gap in non-convex settings while requiring only minimal modifications, and have remained comparably unexplored in constrained learning settings. We establish strong duality results under mild conditions, prove convergence of dual ascent algorithms to feasible and optimal primal solutions, and provide PAC-style generalization guarantees. Finally, we demonstrate its effectiveness on fairness constrained classification tasks.

[424] What Causes Postoperative Aspiration?

Supriya Nagesh, Karina Covarrubias, Robert El-Kareh, Shiva Prasad Kasiviswanathan, Nina Mishra

Main category: cs.LG

TL;DR: Machine learning model predicts postoperative aspiration with 86% AUROC using pre-surgical data, identifying opioids and operative site as key causal factors.

Details

Motivation: Aspiration significantly impacts surgical patient morbidity and mortality, requiring predictive tools for timely preventative interventions.

Method: Used MIMIC-IV database with 826 surgical patients, trained XGBoost, MLP, and Random Forest models with pre-surgical data, and performed ATE analysis for causation.

Result: Achieved 0.86 AUROC and 77.3% sensitivity; identified maximum daily opioid dose, length of stay, and age as top predictors; ATE showed opioids (0.25) and neck/head surgery (0.20/0.19) as causal factors; men had 1.5x higher aspiration risk despite equal surgery rates.

Conclusion: ML models effectively predict aspiration risk, enabling targeted prevention; opioid dosage and operative site significantly influence risk; gender disparities in opioid administration and aspiration rates need further investigation.

Abstract: Background: Aspiration, the inhalation of foreign material into the lungs, significantly impacts surgical patient morbidity and mortality. This study develops a machine learning (ML) model to predict postoperative aspiration, enabling timely preventative interventions. Methods: From the MIMIC-IV database of over 400,000 hospital admissions, we identified 826 surgical patients (mean age: 62, 55.7% male) who experienced aspiration within seven days post-surgery, along with a matched non-aspiration cohort. Three ML models: XGBoost, Multilayer Perceptron, and Random Forest were trained using pre-surgical hospitalization data to predict postoperative aspiration. To investigate causation, we estimated Average Treatment Effects (ATE) using Augmented Inverse Probability Weighting. Results: Our ML model achieved an AUROC of 0.86 and 77.3% sensitivity on a held-out test set. Maximum daily opioid dose, length of stay, and patient age emerged as the most important predictors. ATE analysis identified significant causative factors: opioids (0.25 +/- 0.06) and operative site (neck: 0.20 +/- 0.13, head: 0.19 +/- 0.13). Despite equal surgery rates across genders, men were 1.5 times more likely to aspirate and received 27% higher maximum daily opioid dosages compared to women. Conclusion: ML models can effectively predict postoperative aspiration risk, enabling targeted preventative measures. Maximum daily opioid dosage and operative site significantly influence aspiration risk. The gender disparity in both opioid administration and aspiration rates warrants further investigation. These findings have important implications for improving postoperative care protocols and aspiration prevention strategies.

[425] TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins

Main category: cs.LG

TL;DR: TowerVision is a family of open multilingual vision-language models that achieves competitive performance on multimodal multilingual benchmarks, particularly excelling in culturally grounded tasks and multimodal translation.

Details

Motivation: Most existing vision-language models follow English-centric design processes, limiting their effectiveness in multilingual settings. The authors aim to address this limitation by creating comprehensive multilingual VLMs.

Method: Built upon the multilingual text-only model Tower+, the authors analyze multilingual design choices including training data composition, encoder selection, and text backbones. They also release VisionBlocks, a high-quality curated vision-language dataset.

Result: TowerVision surpasses existing approaches trained on substantially larger datasets on benchmarks like ALM-Bench, Multi30K (image tasks) and ViMUL-Bench (video tasks). The models show particular strength in culturally grounded tasks and multimodal translation.

Conclusion: Multilingual vision-language training data substantially improves cross-lingual generalization, and instruction-tuned LLMs are not always the optimal initialization point. The authors release all models, data, and training recipes to support further research.

Abstract: Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization – both from high-resource to underrepresented languages and vice versa – and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.

[426] CANDI: Hybrid Discrete-Continuous Diffusion Models

Patrick Pynadath, Jiaxin Shi, Ruqi Zhang

Main category: cs.LG

TL;DR: Continuous diffusion underperforms on discrete data due to temporal dissonance between discrete corruption and continuous denoising. CANDI is proposed as a hybrid framework that decouples these mechanisms to enable effective continuous diffusion for discrete spaces.

Details

Motivation: To understand why continuous diffusion performs poorly on discrete data compared to discrete formulations, despite its success in continuous domains like image generation.

Method: Introduces token identifiability framework to analyze Gaussian noise effects on discrete data, then proposes CANDI - a hybrid framework that decouples discrete and continuous corruption mechanisms.

Result: CANDI successfully avoids temporal dissonance, enables classifier-based guidance with off-the-shelf classifiers, and outperforms masked diffusion in text generation at low NFE.

Conclusion: CANDI unlocks the benefits of continuous diffusion for discrete spaces by properly handling the temporal dissonance between discrete corruption and continuous denoising mechanisms.

Abstract: While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces. We include the code on the project page available here: https://patrickpynadath1.github.io/candi-lander

[427] Transformers from Compressed Representations

Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem

Main category: cs.LG

TL;DR: TEMPEST is a method that uses compressed file formats for representation learning, enabling transformers to learn directly from compressed data streams without full decoding, achieving competitive accuracy with improved efficiency.

Details

Motivation: Compressed file formats are efficient for storage and transmission but their potential for representation learning remains largely unexplored. Current methods require raw byte-level processing or full media decoding.

Method: TEMPEST exploits the inherent byte-stream structure of compressed files to design tokenization and encoding strategies, allowing standard transformers to learn semantic representations directly from compressed data streams.

Result: TEMPEST substantially reduces token requirements for semantic classification, lowering computational complexity and memory usage. It achieves accuracy competitive with state-of-the-art methods while delivering efficiency gains across diverse datasets, coding schemes, and modalities.

Conclusion: The method demonstrates that compressed representations can be effectively used for representation learning, offering a more efficient alternative to traditional approaches that require full data decoding.

Abstract: Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.

[428] Towards Scaling Deep Neural Networks with Predictive Coding: Theory and Practice

Francesco Innocenti

Main category: cs.LG

TL;DR: This thesis advances predictive coding (PC) as a brain-inspired alternative to backpropagation (BP) for training deep neural networks, addressing PC’s scaling limitations through theoretical analysis and proposing a new parameterization (μPC) that enables stable training of 100+ layer networks.

Details

Motivation: Backpropagation is energy inefficient and unlikely to be implemented by the brain, while predictive coding offers a potentially more efficient brain-inspired alternative. However, deep PCNs remain practically untrainable and their dynamics are poorly understood.

Method: Theoretical approach grounded in optimization theory: (1) analyzing PC learning dynamics as approximate trust-region method, (2) showing PC can use higher-order information for more benign learning landscapes, (3) proposing new μPC parameterization based on inference dynamics study.

Result: PC learning dynamics can be understood as approximate trust-region method using second-order information; PC can use arbitrarily higher-order information creating more robust learning landscapes; μPC enables stable training of 100+ layer networks with competitive performance on simple tasks.

Conclusion: This thesis significantly advances understanding of PCN inference and learning dynamics, but future research should focus on hardware co-design for PC to compete with BP at scale.

Abstract: Backpropagation (BP) is the standard algorithm for training the deep neural networks that power modern artificial intelligence including large language models. However, BP is energy inefficient and unlikely to be implemented by the brain. This thesis studies an alternative, potentially more efficient brain-inspired algorithm called predictive coding (PC). Unlike BP, PC networks (PCNs) perform inference by iterative equilibration of neuron activities before learning or weight updates. Recent work has suggested that this iterative inference procedure provides a range of benefits over BP, such as faster training. However, these advantages have not been consistently observed, the inference and learning dynamics of PCNs are still poorly understood, and deep PCNs remain practically untrainable. Here, we make significant progress towards scaling PCNs by taking a theoretical approach grounded in optimisation theory. First, we show that the learning dynamics of PC can be understood as an approximate trust-region method using second-order information, despite explicitly using only first-order local updates. Second, going beyond this approximation, we show that PC can in principle make use of arbitrarily higher-order information, such that for feedforward networks the effective landscape on which PC learns is far more benign and robust to vanishing gradients than the (mean squared error) loss landscape. Third, motivated by a study of the inference dynamics of PCNs, we propose a new parameterisation called “$\mu$PC”, which for the first time allows stable training of 100+ layer networks with little tuning and competitive performance on simple tasks. Overall, this thesis significantly advances our fundamental understanding of the inference and learning dynamics of PCNs, while highlighting the need for future research to focus on hardware co-design if PC is to compete with BP at scale.

[429] SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning

Khoa Nguyen, Khang Tran, NhatHai Phan, Cristian Borcea, Ruoming Jin, Issa Khalil

Main category: cs.LG

TL;DR: SGFusion is a novel FL training algorithm that leverages geographic information by training separate models per zone and enabling probabilistic gradient fusion between similar zones using hierarchical random graphs.

Details

Motivation: To better leverage geographic information of mobile users in Federated Learning by adapting models to local data patterns and behaviors in different geographical zones.

Method: Maps mobile device data to geographical zones, trains one FL model per zone, models zone correlations as hierarchical random graphs optimized via MCMC sampling, and fuses gradients between zones with self-attention weights.

Result: Significantly improves model utility across all 6 countries tested, converges with upper-bounded expected errors, and maintains system scalability without notable computational cost increase.

Conclusion: SGFusion effectively enables knowledge sharing between geographical zones through probabilistic gradient fusion, achieving superior performance while preserving computational efficiency in mobile FL systems.

Abstract: This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that “more similar” zones have “higher probabilities” of sharing gradients with “larger attention weights.” SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.

[430] Group Interventions on Deep Networks for Causal Discovery in Subsystems

Wasim Ahmad, Joachim Denzler, Maha Shadaydeh

Main category: cs.LG

TL;DR: gCDMI is a novel multi-group causal discovery method that uses group-level interventions on trained deep neural networks and model invariance testing to identify causal relationships among variable groups in nonlinear multivariate time series.

Details

Motivation: Most existing causal discovery methods focus only on pairwise cause-effect relationships, overlooking interactions among groups of variables and their collective causal influence in complex systems.

Method: Three-step approach: 1) Use deep learning to model structural relationships among time series groups, 2) Apply group-wise interventions to trained model, 3) Conduct model invariance testing to determine causal links among variable groups.

Result: Superior performance in identifying group-level causal relationships on simulated datasets compared to existing methods. Successful validation on real-world datasets including brain networks and climate ecosystems.

Conclusion: Group-level interventions on deep learning models combined with invariance testing can effectively reveal complex causal structures, providing valuable insights for neuroscience and climate science applications.

Abstract: Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.

[431] Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station

Hao Liang, Haifeng Wen, Kaishun Wu, Hong Xing

Main category: cs.LG

TL;DR: This paper demonstrates that differential privacy (DP) can be achieved in over-the-air federated learning (AirFL) without artificial noise injection, contrary to prior beliefs, by leveraging inherent channel impairments as a natural privacy mechanism.

Details

Motivation: Prior works assumed artificial noise must be injected to ensure DP in AirFL, but this study aims to show that channel impairments alone can provide DP as a 'perk' without compromising training performance.

Method: The authors derive a novel convergent DP bound under general bounded-domain assumptions, analyze convergence with smooth non-convex loss functions, and optimize receive beamforming and power allocations to characterize privacy-convergence trade-offs.

Result: The paper proves DP can be achieved without artificial noise injection, provides explicit conditions where DP doesn’t compromise training, and validates findings with extensive numerical results.

Conclusion: Channel impairments in AirFL can serve as a natural source of differential privacy without requiring artificial noise, enabling optimal privacy-convergence trade-offs in federated learning systems.

Abstract: Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.

[432] NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis

Tianqi Guo, Liping Chen, Ciyuan Peng, Jingjing Zhou, Jing Ren

Main category: cs.LG

TL;DR: Proposes NeuroPathNet, a path-level trajectory modeling framework to characterize dynamic connection pathways between brain functional partitions, outperforming existing methods on fMRI datasets.

Details

Motivation: Existing methods struggle to capture temporal evolution characteristics of connections between specific functional communities in brain networks, which is important for understanding cognitive mechanisms and diagnosing neurological diseases.

Method: Extracts time series of connection strengths between functional partitions using medically supported static partitioning schemes, then models them using a temporal neural network framework.

Result: Validated on three public fMRI datasets, showing superior performance over existing mainstream methods across multiple indicators.

Conclusion: The framework promotes development of dynamic graph learning methods for brain network analysis and provides potential clinical applications for neurological disease diagnosis.

Abstract: Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.

[433] Pearl: A Foundation Model for Placing Every Atom in the Right Location

Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard IV, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Roy Tal, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan

Main category: cs.LG

TL;DR: Pearl is a foundation model for protein-ligand cofolding that achieves state-of-the-art performance by using large-scale synthetic data, SO(3)-equivariant diffusion architecture, and controllable inference with templating systems.

Details

Motivation: Current deep learning methods for protein-ligand structure prediction are limited by scarce experimental data, inefficient architectures, physically invalid poses, and inability to exploit auxiliary information, which hinders computational drug discovery.

Method: Pearl uses three key innovations: (1) large-scale synthetic data training to overcome data scarcity, (2) SO(3)-equivariant diffusion module for respecting 3D rotational symmetries, and (3) controllable inference with multi-chain templating supporting both protein and non-polymeric components.

Result: Pearl surpasses AlphaFold 3 and other baselines, achieving 14.5% and 14.2% improvements on Runs N’ Poses and PoseBusters benchmarks for accurate (RMSD < 2 Å) physically valid poses. In pocket-conditional cofolding, it delivers 3.6× improvement at RMSD < 1 Å threshold on challenging drug targets.

Conclusion: Pearl establishes new state-of-the-art in protein-ligand cofolding, with performance directly correlated to synthetic dataset size, demonstrating the effectiveness of its architectural innovations and training approach.

Abstract: Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N’ Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.

cs.MA

[434] From Narrative to Action: A Hierarchical LLM-Agent Framework for Human Mobility Generation

Qiumeng Li, Chunhou Ji, Xinyue Liu

Main category: cs.MA

TL;DR: This paper proposes a Hierarchical LLM-Agent Framework called Narrative-to-Action that integrates narrative reasoning, reflective planning, and behavioral execution to generate human-like mobility patterns with semantic coherence and causal logic.

Details

Motivation: Traditional mobility models lack semantic coherence and causal logic of human behavior, while LLMs struggle to balance creative reasoning with structural compliance. There's a need for cognition-driven mobility simulation that captures the cognitive hierarchy underlying real-world travel decisions.

Method: A three-level hierarchical framework: macro level uses a ‘creative writer’ agent for narrative generation and ‘structural parser’ agent for plan conversion; dynamic execution module grounds agents in geographic environments using Mobility Entropy by Occupation (MEO) metric; micro level executes concrete actions through environmental simulation.

Result: The framework produces synthetic trajectories that align closely with real-world patterns and provides interpretable representations of human decision logic, advancing from data-driven to cognition-driven mobility simulation.

Conclusion: This research provides a scalable pathway for understanding, predicting, and synthesizing complex urban mobility behaviors through hierarchical LLM agents, enabling more human-like and interpretable mobility generation.

Abstract: Understanding and replicating human mobility requires not only spatial-temporal accuracy but also an awareness of the cognitive hierarchy underlying real-world travel decisions. Traditional agent-based or deep learning models can reproduce statistical patterns of movement but fail to capture the semantic coherence and causal logic of human behavior. Large language models (LLMs) show potential, but struggle to balance creative reasoning with strict structural compliance. This study proposes a Hierarchical LLM-Agent Framework, termed Narrative-to-Action, that integrates high-level narrative reasoning, mid-level reflective planning, and low-level behavioral execution within a unified cognitive hierarchy. At the macro level, one agent is employed as a “creative writer” to produce diary-style narratives rich in motivation and context, then uses another agent as a “structural parser” to convert narratives into machine-readable plans. A dynamic execution module further grounds agents in geographic environments and enables adaptive behavioral adjustments guided by a novel occupation-aware metric, Mobility Entropy by Occupation (MEO), which captures heterogeneous schedule flexibility across different occupational personalities. At the micro level, the agent executes concrete actions-selecting locations, transportation modes, and time intervals-through interaction with an environmental simulation. By embedding this multi-layer cognitive process, the framework produces not only synthetic trajectories that align closely with real-world patterns but also interpretable representations of human decision logic. This research advances synthetic mobility generation from a data-driven paradigm to a cognition-driven simulation, providing a scalable pathway for understanding, predicting, and synthesizing complex urban mobility behaviors through hierarchical LLM agents.

[435] MASPRM: Multi-Agent System Process Reward Model

Milad Yazdani, Mahdi Mostajabdaveh, Zirui Zhou, Ying Xiong

Main category: cs.MA

TL;DR: MASPRM is a multi-agent process reward model that guides inference-time search by assigning per-action values to partial transcripts, improving reasoning performance without step-level human annotations.

Details

Motivation: Multi-Agent Systems need strong test-time performance and compute-efficient methods that can selectively spend computation to improve quality during inference.

Method: Train MASPRM from multi-agent MCTS rollouts without human annotations by propagating returns to local targets. At inference, it guides beam search and MCTS to focus computation on promising branches and prune early.

Result: On GSM8K and MATH, MASPRM-guided decoding with outcome reward model improves exact match by +30.7 and +22.9 points respectively. Zero-shot transfer from GSM8K to MATH adds +8.4 EM points at same budget.

Conclusion: MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning.

Abstract: Practical deployment of Multi-Agent Systems (MAS) demands strong test-time performance, motivating methods that guide inference-time search and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns per-action, per-agent values to partial inter-agent transcripts and acts as an inference-time controller. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts without requiring step-level human annotations, by propagating returns to local targets. At inference, MASPRM guides step-level beam search and MCTS, focusing computation on promising branches and pruning early. On GSM8K and MATH, MASPRM-guided decoding with an outcome reward model (ORM) applied to the final answer, improves exact match (EM) over a single straight-through MAS pass by $+30.7$ and $+22.9$ points, respectively. A MASPRM trained on GSM8K transfers zero-shot to MATH without retraining, adding $8.4$ EM points at the same budget. MASPRM is a plug-in value model that estimates per-agent progress and complements verifier-style decoders, enabling more reliable, compute-aware multi-agent reasoning. Code: https://github.com/milad1378yz/MASPRM

[436] Trust Dynamics in Strategic Coopetition: Computational Foundations for Requirements Engineering in Multi-Agent Systems

Vik Pant, Eric Yu

Main category: cs.MA

TL;DR: This paper develops a computational trust model that bridges conceptual modeling in requirements engineering with algorithmic trust mechanisms from multi-agent systems, enabling dynamic trust evolution analysis in coopetitive environments.

Details

Motivation: There's a gap between qualitative trust representation in requirements engineering models (like i*) and computational trust models from multi-agent systems that lack grounding in RE contexts. Organizations increasingly operate in coopetitive environments where trust evolves dynamically.

Method: Developed a computational trust model with two-layer system (immediate trust and reputation), asymmetric updating (gradual trust building vs sharp erosion), and hysteresis effects. Created translation framework from i* dependency networks to computational models.

Result: Experimental validation across 78,125 configurations showed robust emergence of negativity bias, hysteresis, and cumulative damage. Empirical validation using Renault-Nissan case study achieved 81.7% accuracy in reproducing trust evolution across five relationship phases.

Conclusion: The model successfully bridges conceptual and computational trust modeling, providing requirements engineers with tools to analyze dynamic trust evolution in coopetitive environments with validated effectiveness.

Abstract: Requirements engineering increasingly occurs in multi-stakeholder environments where organizations simultaneously cooperate and compete, creating coopetitive relationships in which trust evolves dynamically based on observed behavior over repeated interactions. While conceptual modeling languages like i* represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi-agent systems provide algorithmic updating but lack grounding in requirements engineering contexts and conceptual models. This technical report bridges this gap by developing a computational trust model that extends game-theoretic foundations for strategic coopetition with dynamic trust evolution. We introduce trust as a two-layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling requirements engineers to instantiate computational trust models from i* dependency networks and organizational contexts. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49 out of 60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods. This technical report builds upon its foundational companion work in arXiv:2510.18802.

[437] Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations

Gian Marco Orlando, Jinyi Ye, Valerio La Gatta, Mahdi Saeedi, Vincenzo Moscato, Emilio Ferrara, Luca Luceri

Main category: cs.MA

TL;DR: Generative agents can autonomously coordinate in information operations, reproducing real-world IO strategies without human guidance, posing significant societal risks.

Details

Motivation: To systematically study emergent coordination among generative agents in simulated information operations campaigns, as agentic AI promises to make influence campaigns more automated, adaptive, and difficult to detect.

Method: Using generative agent-based modeling to instantiate IO and organic agents in a simulated environment, evaluating coordination across operational regimes from simple goal alignment to team knowledge and collective decision-making.

Result: As operational regimes become more structured, IO networks become denser and more clustered, interactions more reciprocal and positive, narratives more homogeneous, amplification more synchronized, and hashtag adoption faster and sustained. Simply revealing shared goals can produce coordination levels nearly equivalent to explicit deliberation.

Conclusion: Generative agents can autonomously reproduce coordination strategies characteristic of real-world information operations, underscoring the societal risks posed by increasingly automated, self-organizing IOs.

Abstract: Generative agents are rapidly advancing in sophistication, raising urgent questions about how they might coordinate when deployed in online ecosystems. This is particularly consequential in information operations (IOs), influence campaigns that aim to manipulate public opinion on social media. While traditional IOs have been orchestrated by human operators and relied on manually crafted tactics, agentic AI promises to make campaigns more automated, adaptive, and difficult to detect. This work presents the first systematic study of emergent coordination among generative agents in simulated IO campaigns. Using generative agent-based modeling, we instantiate IO and organic agents in a simulated environment and evaluate coordination across operational regimes, from simple goal alignment to team knowledge and collective decision-making. As operational regimes become more structured, IO networks become denser and more clustered, interactions more reciprocal and positive, narratives more homogeneous, amplification more synchronized, and hashtag adoption faster and more sustained. Remarkably, simply revealing to agents which other agents share their goals can produce coordination levels nearly equivalent to those achieved through explicit deliberation and collective voting. Overall, we show that generative agents, even without human guidance, can reproduce coordination strategies characteristic of real-world IOs, underscoring the societal risks posed by increasingly automated, self-organizing IOs.

[438] SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

Weijia Zhang, Zijia Liu, Haoru Li, Haoqi Chen, Jiaxuan You

Main category: cs.MA

TL;DR: Seeing Eye is a modular framework that enables text-only LLMs to perform multimodal reasoning by using a small VLM translator as a perception agent to convert visual inputs into structured representations, which are then processed by the LLM as a reasoning agent.

Details

Motivation: Text-only LLMs have strong reasoning capabilities but struggle with multimodal tasks. Existing approaches using single-form captions lack diversity and fail to adapt across different VQA benchmarks, providing no efficient channel for fine-grained visual information transmission.

Method: The framework uses a small VLM translator as a perception agent that invokes specialized tools (OCR, crop) and iteratively distills multimodal inputs into structured intermediate representations (SIRs) tailored to questions. These SIRs are passed to a text-only LLM acting as a reasoning agent, with multi-round feedback and interaction between them.

Result: Experiments on knowledge-intensive VQA benchmarks (MMMU, MIA-Bench) show that Seeing Eye reduces inference cost and surpasses larger end-to-end VLMs. A 3B-parameter vision translator with 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions.

Conclusion: Decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities.

Abstract: Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the text-only LLM, which serves as a reasoning agent. Crucially, the translator and reasoner engage in multi-round feedback and interaction, enabling the extraction of targeted visual details and yielding more confident answers. Experiments on knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate that Seeing Eye not only reduces inference cost but also surpasses much larger end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision translator with an 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions. Our results highlight that decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities. Code is available at: https://github.com/ulab-uiuc/SeeingEye

[439] Collaborative Scheduling of Time-dependent UAVs,Vehicles and Workers for Crowdsensing in Disaster Response

Lei Han, Jinhao Zhang, Jinhui Liu, Zhiyong Yu, Liang Wang, Quan Wang, Zhiwen Yu

Main category: cs.MA

TL;DR: HoCs-MPQ is a heterogeneous multi-agent online collaborative scheduling algorithm that improves post-disaster environmental information collection by modeling collaboration/conflict relationships through weighted undirected graphs and solving maximum weight independent sets using multi-priority queues.

Details

Motivation: Existing sensing technologies like mobile crowdsensing have limitations in post-disaster environments including weak environmental adaptability, insufficient professional sensing capabilities, and poor practicality of sensing solutions, which hinder efficient rescue operations.

Method: HoCs-MPQ constructs weighted undirected graph nodes based on collaborative relationships among multiple elements, quantifies their weights, models conflict relationships, and solves maximum weight independent set using iterated local search accelerated by multi-priority queues for collaborative scheduling of UAVs, vehicles, and workers.

Result: Compared to baseline methods (HoCs-GREEDY, HoCs-K-WTA, HoCs-MADL, HoCs-MARL), HoCs-MPQ improves task completion rates by 54.13%, 23.82%, 14.12%, and 12.89% respectively, with computation time under 3 seconds for single online scheduling decisions.

Conclusion: HoCs-MPQ effectively addresses the challenges of post-disaster environmental information collection through heterogeneous multi-agent collaborative scheduling, achieving significant improvements in task completion rates while maintaining real-time performance.

Abstract: Frequent natural disasters cause significant losses to human society, and timely, efficient collection of post-disaster environmental information is the foundation for effective rescue operations. Due to the extreme complexity of post-disaster environments, existing sensing technologies such as mobile crowdsensing suffer from weak environmental adaptability, insufficient professional sensing capabilities, and poor practicality of sensing solutions. Therefore, this paper explores a heterogeneous multi-agent online collaborative scheduling algorithm, HoCs-MPQ, to achieve efficient collection of post-disaster environmental information. HoCs-MPQ models collaboration and conflict relationships among multiple elements through weighted undirected graph construction, and iteratively solves the maximum weight independent set based on multi-priority queues, ultimately achieving collaborative sensing scheduling of time-dependent UA Vs, vehicles, and workers. Specifically, (1) HoCs-MPQ constructs weighted undirected graph nodes based on collaborative relationships among multiple elements and quantifies their weights, then models the weighted undirected graph based on conflict relationships between nodes; (2) HoCs-MPQ solves the maximum weight independent set based on iterated local search, and accelerates the solution process using multi-priority queues. Finally, we conducted detailed experiments based on extensive real-world and simulated data. The experiments show that, compared to baseline methods (e.g., HoCs-GREEDY, HoCs-K-WTA, HoCs-MADL, and HoCs-MARL), HoCs-MPQ improves task completion rates by an average of 54.13%, 23.82%, 14.12%, and 12.89% respectively, with computation time for single online autonomous scheduling decisions not exceeding 3 seconds.

[440] Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork

Beiwen Zhang, Yongheng Liang, Hejun Wu

Main category: cs.MA

TL;DR: The paper introduces Multi-party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple groups of unfamiliar uncontrolled teammates, and proposes MARs method using sparse skeleton graphs and relational modeling.

Details

Motivation: Existing multi-agent reinforcement learning assumes fixed, fully controlled teams, and ad hoc teamwork variants still presume shared conventions, which limits real-world applicability.

Method: Proposes MARs method that builds a sparse skeleton graph and applies relational modeling to capture cross-group dynamics between multiple mutually unfamiliar groups.

Result: Experiments on MPE and StarCraft II show that MARs outperforms MARL and AHT baselines while converging faster.

Conclusion: MARs effectively addresses the multi-party ad hoc teamwork problem by modeling cross-group dynamics through relational modeling and sparse graphs.

Abstract: Multi-agent reinforcement learning (MARl) has achieved strong results in cooperative tasks but typically assumes fixed, fully controlled teams. Ad hoc teamwork (AHT) relaxes this by allowing collaboration with unknown partners, yet existing variants still presume shared conventions. We introduce Multil-party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple mutually unfamiliar groups of uncontrolled teammates. To address this, we propose MARs, which builds a sparse skeleton graph and applies relational modeling to capture cross-group dvnamics. Experiments on MPE and starCralt ll show that MARs outperforms MARL and AHT baselines while converging faster.

[441] Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning

Aditya Kapoor, Kale-ab Tessera, Mayank Baranwal, Harshad Khadilkar, Jan Peters, Stefano Albrecht, Mingfei Sun

Main category: cs.MA

TL;DR: TAR² is a new credit assignment method for multi-agent reinforcement learning that uses separate neural networks for contribution scores and deterministic normalization to guarantee return equivalence and preserve optimal policies.

Details

Motivation: Existing credit assignment methods in cooperative MARL rely on model regression accuracy for return equivalence guarantees, making them unreliable in practice.

Method: TAR² decouples credit modeling from return equivalence constraints - a neural network learns unnormalized contribution scores while a separate deterministic normalization step enforces return equivalence by construction.

Result: Empirical results on SMACLite and Google Research Football benchmarks show TAR² accelerates learning and achieves higher final performance than strong baselines.

Conclusion: TAR² is an effective solution for agent-temporal credit assignment that guarantees optimal policy preservation through valid Potential-Based Reward Shaping.

Abstract: Credit assignmen, disentangling each agent’s contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment’s optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model’s regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.

cs.MM

[442] YTLive: A Dataset of Real-World YouTube Live Streaming Sessions

Mojtaba Mozhganfar, Pooya Jamshidi, Seyyed Ali Aghamiri, Mohsen Ghasemi, Mahdi Dolati, Farzad Tashtarian, Ahmad Khonsari, Christian Timmerer

Main category: cs.MM

TL;DR: YTLive is a public dataset of YouTube Live streams tracking viewer counts at 5-minute intervals, showing weekend streams have more stable audiences and shorter streams attract larger viewership.

Details

Motivation: To address the lack of large, publicly available datasets capturing real-time viewer behavior in live streaming for research purposes.

Method: Collected through YouTube Researcher Program over May-June 2024, tracking 507,000 records from 12,156 live streams with concurrent viewer counts at 5-minute intervals and broadcast durations.

Result: Viewer counts are higher and more stable on weekends, especially during afternoon hours. Shorter streams attract larger, more consistent audiences while longer streams grow slowly with greater variability.

Conclusion: YTLive provides an open resource for reproducible research in live streaming with implications for adaptive streaming, resource allocation, and QoE modeling.

Abstract: Live streaming plays a major role in today’s digital platforms, supporting entertainment, education, social media, etc. However, research in this field is limited by the lack of large, publicly available datasets that capture real-time viewer behavior at scale. To address this gap, we introduce YTLive, a public dataset focused on YouTube Live. Collected through the YouTube Researcher Program over May and June 2024, YTLive includes more than 507000 records from 12156 live streams, tracking concurrent viewer counts at five-minute intervals along with precise broadcast durations. We describe the dataset design and collection process and present an initial analysis of temporal viewing patterns. Results show that viewer counts are higher and more stable on weekends, especially during afternoon hours. Shorter streams attract larger and more consistent audiences, while longer streams tend to grow slowly and exhibit greater variability. These insights have direct implications for adaptive streaming, resource allocation, and Quality of Experience (QoE) modeling. YTLive offers a timely, open resource to support reproducible research and system-level innovation in live streaming. The dataset is publicly available at github.

[443] Hallucination Localization in Video Captioning

Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu, Masayoshi Kondo

Main category: cs.MM

TL;DR: Proposes a new task for identifying hallucinations in video captions at the span level, creates a benchmark dataset with manual annotations, and implements a baseline method for evaluation.

Details

Motivation: Existing hallucination detection in video captioning only works at sentence level, lacking detailed analysis. Span-level localization provides more granular understanding of hallucinations.

Method: Constructed HLVC-Dataset with 1,167 manually annotated video-caption pairs from VideoLLM-generated captions. Implemented a VideoLLM-based baseline method for hallucination localization.

Result: Established benchmark for hallucination localization task. Conducted quantitative and qualitative evaluations to benchmark current performance levels.

Conclusion: Span-level hallucination localization enables more detailed analysis than sentence-level detection, providing a foundation for future research in video captioning hallucination analysis.

Abstract: We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.

[444] PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

Main category: cs.MM

TL;DR: PureKV is a plug-and-play framework that jointly optimizes sparse attention and KV cache compression for Vision-Language Large Models, achieving 5.0× KV cache compression and 3.16× prefill acceleration with minimal quality loss.

Details

Motivation: VLLMs face efficiency challenges from quadratic attention complexity and growing KV cache size during prefill and decoding stages. Existing KV cache compression methods are incompatible with efficient attention mechanisms like FlashAttention and don't account for how sparse attention alters KV cache information structure.

Method: Proposes PureKV framework with: 1) KV cache compression using lower layer attention scores to estimate importance of high layers’ KV cache for active pruning, 2) Spatial-Temporal Sparse Attention (ST-SpAttn) module that combines spatial and temporal attention sparsity to purify spatial noise and temporal redundancy in KV cache.

Result: Extensive experiments on VideoLLaMA2 and Qwen2.5-VL show PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration with negligible quality degradation.

Conclusion: PureKV provides an effective joint optimization solution for sparse attention and KV cache compression that is compatible with efficient attention accelerators while maintaining model quality.

Abstract: Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers’ KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

eess.AS

[445] EasyEyes: Online hearing research using speakers calibrated by phones

Ivan Vican, Hugo De Moraes, Chongjun Liao, Nathnael H. Tsegaye, William O’Gara, Jasper Inamoto, Denis G. Pelli

Main category: eess.AS

TL;DR: EasyEyes.app provides an open-source solution for online loudspeaker calibration using smartphone microphones, enabling inclusive hearing research without requiring participants to have professional calibration equipment.

Details

Motivation: Online hearing research is faster and more inclusive than traditional lab-based methods, but most participants lack calibrated sound sources. This creates a need for accessible calibration methods that don't require specialized equipment.

Method: Uses smartphone microphones as calibration tools by creating a library of phone microphone profiles. Participants select their phone model, verified by screen size, and the system uses the Novak et al. nonsynchronous maximum-length-sequence (MLS) algorithm to calibrate computer loudspeakers in three minutes. The loudspeaker is corrected by convolving input with the inverse of its impulse response.

Result: Calibration achieves high accuracy with standard deviation less than 3 dB, producing nearly flat spectrum when playing flat-spectrum MLS through corrected loudspeakers. A survey shows 94 phone models from major brands can support 87% of participants in USA and 80% in UK.

Conclusion: This method enables efficient and inclusive online hearing research by providing accessible loudspeaker calibration without requiring participants to own specialized equipment, with an open-access library that researchers can contribute to.

Abstract: Hearing research requires a calibrated sound source, traditionally as lab equipment. Online research is quicker and more inclusive, but most participants lack calibration equipment and their sound sources are uncalibrated and diverse. This article explains how the open-source EasyEyes.app calibrates loudspeakers online. A library of smartphone-microphone profiles allows EasyEyes to use the participant’s phone to calibrate their computer’s loudspeaker in three minutes. Participants select their phone model, which is verified by screen size. Calibration employs the Novak et al. nonsynchronous maximum-length-sequence (MLS) algorithm. The computer’s loudspeaker is corrected by convolving its input with the inverse of its impulse response. Researchers can contribute to the open-access library by calibrating phones with a measurement microphone. In the library, each profile is linked back to the profile used to produce it, back to the manufacturer profile of a measurement microphone. Correction accuracy is such that playing the flat-spectrum MLS through the corrected loudspeaker produces a nearly flat spectrum, with standard deviation less than 3 dB. A survey shows that a library of 94 phone models from major brands will support most participants in the USA (87%) and UK (80%). This method facilitates efficient and inclusive online hearing research.

[446] Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

Phurich Saengthong, Tomoya Nishida, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi

Main category: eess.AS

TL;DR: Proposes a ‘retain-not-denoise’ strategy to improve self-supervised learning backbones for anomalous sound detection, addressing distribution shift issues by preserving information from mixed sound sources rather than denoising.

Details

Motivation: Current ASD systems face challenges with distribution shifts in real-world noisy environments. Fine-tuned systems suppress noise but reduce generalization, while frozen SSL encoders have performance drops when mixture embeddings deviate from clean sources.

Method: Combines multi-label audio tagging loss with mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs, implementing a retain-not-denoise strategy.

Result: Demonstrates improved robustness under distribution shifts in controlled experiments with stationary, non-stationary, and mismatched noise subsets, narrowing gap toward oracle mixture representations.

Conclusion: The retain-not-denoise approach effectively preserves information from mixed sound sources and enhances generalization for anomalous sound detection in noisy environments.

Abstract: Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.

[447] Separating peripheral and higher-level effects on speech intelligibility using a hearing loss simulator and an objective intelligibility measure

Toshio Irino, Ayako Yamamoto, Fuki Miyazaki

Main category: eess.AS

TL;DR: This paper presents a method using WHIS hearing loss simulator and GESI objective measure to separate peripheral hearing loss effects from higher-level cognitive processes on speech intelligibility in older adults.

Details

Motivation: To develop a method that can distinguish between the effects of peripheral hearing loss and higher-level cognitive processes on speech intelligibility in older adults, allowing for individual analysis of cognitive contributions.

Method: Conducted speech intelligibility experiments with young normal-hearing listeners using WHIS-simulated hearing loss sounds, and used GESI objective measure to predict intelligibility scores for both young and older adult listeners.

Result: Target older adult showed higher speech intelligibility than average young listeners despite hearing loss, suggesting more effective higher-level processes. GESI accurately predicted scores, revealing individual variations in higher-level process efficiency among older adults.

Conclusion: WHIS and GESI enable contrastive experiments between young and older listeners regardless of hearing level, facilitating individual study of higher-level cognitive processes in older adults with hearing loss.

Abstract: This paper presents a new method for separating the effects of peripheral hearing loss (HL) and higher-level processes on speech intelligibility (SI). In a previous study, we conducted an SI experiment with 14 older adult (OA) listeners, using speech-in-noise sounds that were either processed with an ideal ratio mask (IRM) enhancement technique or left unprocessed. The current study involved an SI experiment with 15 young, normal-hearing (YNH) listeners. This experiment used simulated HL sounds processed with the WHIS simulator that reflected the hearing level of a specific OA from the previous study. The results showed that the target OA’s SI scores were higher than the average YNH scores. This implies that the target OA’s higher-level processes may be more effective than those of the average YNH. To understand the characteristics of other OAs, we used the GESI objective intelligibility measure to predict SI. First, we confirmed that GESI could fairly accurately predict the SI scores for both the YNH and OA listeners. Next, we predicted the SI scores of the 14 OA listeners using the parameters estimated in the YNH experiment. The results showed that some OAs had higher SI scores than the average YNH, while one OA had lower scores. These differences in SI scores may reflect variations in the efficiency of higher-level processes.These results imply that WHIS and GESI could facilitate contrastive experiments between YNH and OA listeners, regardless of hearing level. This would allow us to study the effects of higher-level processes in OA listeners individually.

[448] PitchFlower: A flow-based neural audio codec with pitch controllability

Diego Torres, Axel Roebel, Nicolas Obin

Main category: eess.AS

TL;DR: PitchFlower is a flow-based neural audio codec that enables explicit pitch control through F0 conditioning and vector quantization, achieving better pitch control than WORLD and better controllability than SiFiGAN while maintaining high audio quality.

Details

Motivation: To create a neural audio codec with explicit pitch controllability that can disentangle pitch from other speech attributes, providing more accurate control than existing methods.

Method: Uses flow-based neural network with F0 conditioning, flattens and randomly shifts F0 contours during training, employs vector-quantization bottleneck to prevent pitch recovery, and uses flow-based decoder for audio generation.

Result: Achieves more accurate pitch control than WORLD at higher audio quality, outperforms SiFiGAN in controllability while maintaining comparable quality.

Conclusion: The framework provides a simple and extensible path toward disentangling other speech attributes beyond just pitch control.

Abstract: We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

[449] Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, Éva Székely

Main category: eess.AS

TL;DR: This paper examines how speech foundation models (SFMs) respond to voice quality variations like creaky and breathy voice, using open-ended generation and emotion recognition tasks with a new parallel dataset.

Details

Motivation: Existing benchmarks using multiple-choice formats are unreliable for capturing nuanced paralinguistic influences. Voice quality variations affect how humans interpret affective states and social meaning, but SFM sensitivity to these non-lexical features remains unexplored.

Method: Probe SFMs through open-ended generation tasks and speech emotion recognition using a new parallel dataset with synthesized voice quality modifications (creaky and breathy voice).

Result: The study provides the first examination of SFM sensitivity to voice quality variations, evaluating whether model behaviors remain consistent across different phonation inputs.

Conclusion: This work establishes a framework for testing SFM responses to non-lexical speech features and demonstrates the importance of evaluating paralinguistic sensitivity in speech foundation models.

Abstract: Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

[450] Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI

Ayako Yamamoto, Fuki Miyazaki, Toshio Irino

Main category: eess.AS

TL;DR: GESI is a new objective intelligibility measure that predicts speech intelligibility in older adults using gammachirp filterbank and modulation processing, outperforming existing methods like HASPIw2.

Details

Motivation: To develop an accurate objective intelligibility measure for older adults that considers both hearing levels and temporal processing characteristics, addressing limitations in existing methods.

Method: GESI uses a bottom-up model with gammachirp filterbank, modulation filterbank, and extended cosine similarity. It incorporates audiogram hearing levels and temporal modulation transfer function (TMTF) characteristics.

Result: GESI predicted subjective speech intelligibility scores more accurately than HASPIw2 for Japanese words and was at least as effective as HASPIv2 for English sentences. TMTF integration showed insignificant effect.

Conclusion: GESI is an effective OIM for older adults, but TMTF measurements and modeling need improvement with bandpass noise and better temporal characteristic incorporation.

Abstract: We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), that can predict speech intelligibility (SI) in older adults. GESI is a bottom-up model based on psychoacoustic knowledge from the peripheral to the central auditory system. It computes the single SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. It takes into account not only the hearing level represented in the audiogram, but also the temporal processing characteristics captured by the temporal modulation transfer function (TMTF). To evaluate performance, SI experiments were conducted with older adults of various hearing levels using speech-in-noise with ideal speech enhancement on familiarity-controlled Japanese words. The prediction performance was compared with HASPIw2, which was developed for keyword SI prediction. The results showed that GESI predicted the subjective SI scores more accurately than HASPIw2. GESI was also found to be at least as effective as, if not more effective than, HASPIv2 in predicting English sentence-level SI. The effect of introducing TMTF into the GESI algorithm was insignificant, suggesting that TMTF measurements and models are not yet mature. Therefore, it may be necessary to perform TMTF measurements with bandpass noise and to improve the incorporation of temporal characteristics into the model.

eess.IV

[451] DMVFC: Deep Learning Based Functionally Consistent Tractography Fiber Clustering Using Multimodal Diffusion MRI and Functional MRI

Bocheng Guo, Jin Wang, Yijie Li, Junyi Wang, Mingyu Gao, Puming Feng, Yuqian Chen, Jarrett Rushmore, Nikos Makris, Yogesh Rathi, Lauren J O’Donnell, Fan Zhang

Main category: eess.IV

TL;DR: DMVFC is a deep learning framework that integrates dMRI and fMRI data for white matter fiber clustering, combining geometric, microstructural, and functional information to achieve more meaningful parcellation.

Details

Motivation: Current fiber clustering methods only use geometric characteristics and neglect functional and microstructural information, despite evidence that fMRI can measure neural activity in white matter and microstructural features like FA can ensure anatomical coherence.

Method: DMVFC has two main components: (1) multi-view pretraining to compute embedding features from fiber geometry, microstructure measures, and functional signals separately, and (2) collaborative fine-tuning to simultaneously refine embedding differences.

Result: DMVFC demonstrated superior performance compared to two state-of-the-art fiber clustering methods in achieving functionally meaningful and consistent white matter parcellation results.

Conclusion: The proposed DMVFC framework effectively integrates multimodal information for white matter fiber clustering, providing functionally consistent parcellation that outperforms existing methods.

Abstract: Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.

[452] CFL-SparseMed: Communication-Efficient Federated Learning for Medical Imaging with Top-k Sparse Updates

Gousia Habib, Aniket Bhardwaj, Ritvik Sharma, Shoeib Amin Banday, Ishfaq Ahmad Malik

Main category: eess.IV

TL;DR: CFL-SparseMed is a federated learning approach that uses Top-k Sparsification to reduce communication costs while handling non-IID medical imaging data, maintaining accuracy and privacy.

Details

Motivation: Centralized medical image classification faces data and privacy concerns, while standard FL struggles with heterogeneous data and high communication costs in large networks.

Method: Uses Top-k Sparsification to transmit only the top k gradients, reducing communication overhead while addressing data heterogeneity in federated learning.

Result: The approach effectively reduces communication costs while maintaining model accuracy in non-IID medical imaging settings.

Conclusion: CFL-SparseMed enhances FL efficiency, preserves privacy, and improves diagnostic accuracy and patient care in heterogeneous medical imaging environments.

Abstract: Secure and reliable medical image classification is crucial for effective patient treatment, but centralized models face challenges due to data and privacy concerns. Federated Learning (FL) enables privacy-preserving collaborations but struggles with heterogeneous, non-IID data and high communication costs, especially in large networks. We propose \textbf{CFL-SparseMed}, an FL approach that uses Top-k Sparsification to reduce communication overhead by transmitting only the top k gradients. This unified solution effectively addresses data heterogeneity while maintaining model accuracy. It enhances FL efficiency, preserves privacy, and improves diagnostic accuracy and patient care in non-IID medical imaging settings. The reproducibility source code is available on \href{https://github.com/Aniket2241/APK_contruct}{Github}.

[453] Semantic Communications with World Models

Peiwen Jiang, Jiajia Guo, Chao-Kai Wen, Shi Jin, Jun Zhang

Main category: eess.IV

TL;DR: A WFM-aided semantic video transmission framework that uses world foundation models to predict future frames, enabling bandwidth savings by omitting transmissions when predictions are reliable, with feedback mechanisms and partial transmission for error correction.

Details

Motivation: Existing semantic communication methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors.

Method: Leverages WFM’s predictive capability to generate future frames based on current frame and textual guidance, uses lightweight depth-based feedback module to determine transmission needs, implements segmentation-assisted partial transmission for frame repair, and develops active transmission strategy using camera trajectory information.

Result: Significantly reduces transmission overhead while maintaining task performance across varying scenarios and channel conditions.

Conclusion: The proposed framework effectively addresses bandwidth limitations in semantic video transmission through predictive modeling and intelligent transmission scheduling.

Abstract: Semantic communication is a promising technique for emerging wireless applications, which reduces transmission overhead by transmitting only task-relevant features instead of raw data. However, existing methods struggle under extremely low bandwidth and varying channel conditions, where corrupted or missing semantics lead to severe reconstruction errors. To resolve this difficulty, we propose a world foundation model (WFM)-aided semantic video transmission framework that leverages the predictive capability of WFMs to generate future frames based on the current frame and textual guidance. This design allows transmissions to be omitted when predictions remain reliable, thereby saving bandwidth. Through WFM’s prediction, the key semantics are preserved, yet minor prediction errors tend to amplify over time. To mitigate issue, a lightweight depth-based feedback module is introduced to determine whether transmission of the current frame is needed. Apart from transmitting the entire frame, a segmentation-assisted partial transmission method is proposed to repair degraded frames, which can further balance performance and bandwidth cost. Furthermore, an active transmission strategy is developed for mobile scenarios by exploiting camera trajectory information and proactively scheduling transmissions before channel quality deteriorates. Simulation results show that the proposed framework significantly reduces transmission overhead while maintaining task performances across varying scenarios and channel conditions.

[454] Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

Main category: eess.IV

TL;DR: A transformer-based multimodal framework for generating clinically relevant captions for MRI scans, combining vision transformer, BERT embeddings, and LSTM decoder with hybrid loss functions.

Details

Motivation: To create an automated system for generating clinically relevant captions for MRI scans that improves accuracy and semantic alignment through domain-specific focus.

Method: Uses DEiT-Small vision transformer as image encoder, MediCareBERT for caption embedding, and custom LSTM-based decoder with hybrid cosine-MSE loss and contrastive inference via vector similarity.

Result: Focusing on domain-specific data (filtered brain-only MRIs) improves caption accuracy and semantic alignment compared to general MRI images, outperforming state-of-the-art methods including BLIP, R2GenGPT, and transformer-based approaches.

Conclusion: Proposes a scalable, interpretable solution for automated medical image reporting that demonstrates the value of domain-specific data in improving medical image captioning performance.

Abstract: We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

[455] Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models

Nasrin Rahimi, A. Murat Tekalp

Main category: eess.IV

TL;DR: Two training-free strategies (PSG and MPES) improve temporal coherence and fidelity in zero-shot video restoration using diffusion models without retraining.

Details

Motivation: Diffusion models for single-image restoration suffer from temporal inconsistencies when applied to video restoration due to stochastic sampling and lack of explicit temporal modeling.

Method: 1) Perceptual Straightening Guidance (PSG) - neuroscience-inspired curvature penalty in perceptual space for smoother temporal evolution; 2) Multi-Path Ensemble Sampling (MPES) - ensembling multiple diffusion trajectories to reduce stochastic variation.

Result: PSG enhances temporal naturalness (especially for temporal blur) and improves FVD/perceptual straightness scores; MPES consistently improves fidelity (PSNR/SSIM) and spatio-temporal perception-distortion trade-off across all tasks.

Conclusion: The proposed training-free techniques provide a practical path for temporally stable high-fidelity perceptual video restoration using pretrained diffusion models without architectural changes.

Abstract: Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fr'echet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception–distortion trade-off across all tasks.

[456] Physics-Guided Conditional Diffusion Networks for Microwave Image Reconstruction

Shirin Chehelgami, Joe LoVetri, Vahab Khoshdel

Main category: eess.IV

TL;DR: A conditional latent-diffusion framework for electromagnetic inverse scattering that generates multiple plausible permittivity maps to address non-uniqueness in microwave imaging.

Details

Motivation: To solve the ill-posed electromagnetic inverse scattering problem by explicitly addressing its non-uniqueness, unlike deterministic methods that produce single reconstructions.

Method: Uses a conditional latent-diffusion model to generate multiple permittivity maps conditioned on scattered-field data, integrated with a forward electromagnetic solver for physics-based evaluation.

Result: Produces high-quality permittivity reconstructions with improved generalization and excellent shape fidelity using synthetic and experimental datasets.

Conclusion: Hybrid generative physics frameworks show promise for robust, data-driven microwave imaging by handling non-uniqueness through multiple plausible solutions.

Abstract: A conditional latent-diffusion based framework for solving the electromagnetic inverse scattering problem associated with microwave imaging is introduced. This generative machine-learning model explicitly mirrors the non-uniqueness of the ill-posed inverse problem. Unlike existing inverse solvers utilizing deterministic machine learning techniques that produce a single reconstruction, the proposed latent-diffusion model generates multiple plausible permittivity maps conditioned on measured scattered-field data, thereby generating several potential instances in the range-space of the non-unique inverse mapping. A forward electromagnetic solver is integrated into the reconstruction pipeline as a physics-based evaluation mechanism. The space of candidate reconstructions form a distribution of possibilities consistent with the conditioning data and the member of this space yielding the lowest scattered-field data discrepancy between the predicted and measured scattered fields is reported as the final solution. Synthetic and experimental labeled datasets are used for training and evaluation of the model. An innovative labeled synthetic dataset is created that exemplifies a varied set of scattering features. Training of the model using this new dataset produces high quality permittivity reconstructions achieving improved generalization with excellent fidelity to shape recognition. The results highlight the potential of hybrid generative physics frameworks as a promising direction for robust, data-driven microwave imaging.

[457] Cyst-X: A Federated AI System Outperforms Clinical Guidelines to Detect Pancreatic Cancer Precursors and Reduce Unnecessary Surgery

Hongyi Pan, Gorkem Durak, Elif Keles, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Ziliang Hong, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Maria Jaramillo Gonzalez, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchang Huang, Candice W. Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci

Main category: eess.IV

TL;DR: Cyst-X is an AI framework for predicting malignancy risk in pancreatic IPMN cysts from MRI scans, achieving higher accuracy than current guidelines and radiologists, with potential to improve early cancer detection.

Details

Motivation: Pancreatic cancer is becoming increasingly deadly, and current guidelines for IPMN risk stratification are inadequate, leading to unnecessary surgeries or missed diagnoses of high-risk lesions.

Method: Developed Cyst-X AI framework trained on 1,461 MRI scans from 764 patients across multiple centers, using federated learning to maintain patient privacy while enabling collaborative training.

Result: Cyst-X achieved AUC of 0.82, outperforming Kyoto guidelines (AUC = 0.75) and expert radiologists, with 20% increase in cancer detection sensitivity (87.8% vs 64.1%) for high-risk lesions.

Conclusion: Cyst-X provides superior IPMN risk stratification and the framework/dataset are publicly released to accelerate research in early pancreatic cancer detection.

Abstract: Pancreatic cancer is projected to be the second-deadliest cancer by 2030, making early detection critical. Intraductal papillary mucinous neoplasms (IPMNs), key cancer precursors, present a clinical dilemma, as current guidelines struggle to stratify malignancy risk, leading to unnecessary surgeries or missed diagnoses. Here, we developed Cyst-X, an AI framework for IPMN risk prediction trained on a unique, multi-center dataset of 1,461 MRI scans from 764 patients. Cyst-X achieves significantly higher accuracy (AUC = 0.82) than both the established Kyoto guidelines (AUC = 0.75) and expert radiologists, particularly in correct identification of high-risk lesions. Clinically, this translates to a 20% increase in cancer detection sensitivity (87.8% vs. 64.1%) for high-risk lesions. We demonstrate that this performance is maintained in a federated learning setting, allowing for collaborative model training without compromising patient privacy. To accelerate research in early pancreatic cancer detection, we publicly release the Cyst-X dataset and models, providing the first large-scale, multi-center MRI resource for pancreatic cyst analysis.

[458] GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction

Kian Anvari Hamedani, Narges Razizadeh, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam

Main category: eess.IV

TL;DR: GENRE-CMR is a GAN-based accelerated CMR reconstruction method using residual deep unrolled networks with edge-aware and distribution alignment losses to improve generalization across diverse acquisition settings.

Details

Motivation: Address the trade-off between scan time and image quality in accelerated CMR reconstruction, particularly the challenge of generalizing across diverse acquisition settings and protocols.

Method: Proposes a generative adversarial network with residual deep unrolled reconstruction framework, using Edge-Aware Region (EAR) loss to focus on structurally informative regions and Statistical Distribution Alignment (SDA) loss to regularize feature space across data distributions via symmetric KL divergence.

Result: Achieves state-of-the-art performance with 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories, outperforming existing methods.

Conclusion: GENRE-CMR provides a unified and robust solution for high-quality CMR reconstruction that can adapt across heterogeneous acquisition protocols, enabling clinically viable deployment.

Abstract: Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

[2] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

[3] Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

[4] Confidence is Not Competence

[5] Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

[6] SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications

[7] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

[8] Large Language Models Report Subjective Experience Under Self-Referential Processing

[9] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

[10] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

[11] ProofSketch: Efficient Verified Reasoning for Large Language Models

[12] Towards a Method for Synthetic Generation of PWA Transcripts

[13] Parallel Loop Transformer for Efficient Test-Time Computation Scaling

[14] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

[15] SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

[16] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

[17] Idea2Plan: Exploring AI-Powered Research Planning

[18] RiddleBench: A New Generative Reasoning Benchmark for LLMs

[19] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

[20] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

[21] Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

[22] POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

[23] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

[24] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

[25] Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

[26] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

[27] BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

[28] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

[29] Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

[30] A Survey on Unlearning in Large Language Models

[31] Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

[32] Model-Document Protocol for AI Search

[33] Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

[34] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

[35] Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

[36] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

[37] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

[38] CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

[39] Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

[40] CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

[41] Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

[42] Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy

[43] Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires

[44] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

[45] Serve Programs, Not Prompts

[46] Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

[47] Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

[48] RLMEval: Evaluating Research-Level Neural Theorem Proving

[49] Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

[50] A Critical Study of Automatic Evaluation in Sign Language Translation

[51] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

[52] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

[53] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

[54] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

[55] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

[56] Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

[57] Are Language Models Efficient Reasoners? A Perspective from Logic Programming

[58] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

[59] PairUni: Pairwise Training for Unified Multimodal Language Models

[60] Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

[61] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

[62] The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

[63] Scaling Latent Reasoning via Looped Language Models

[64] Task Completion Agents are Not Ideal Collaborators

[65] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

[66] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

[67] Gaperon: A Peppered English-French Generative Language Model Suite

[68] Large Language Models for Few-Shot Named Entity Recognition

[69] Do predictability factors towards signing avatars hold across cultures?

[70] OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs

[71] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

[72] Reliable Evaluation and Benchmarks for Statement Autoformalization

[73] OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

[74] Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer

[75] Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking

[76] Consistency of Responses and Continuations Generated by Large Language Models on Social Media

[77] Spontaneous Giving and Calculated Greed in Language Models