Daily arXiv Papers - 2026-03-02

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

Wonwoo Jeong

Main category: eess.AS

TL;DR: Analysis of Fréchet Audio Distance (FAD) reveals systematic biases in text-to-audio evaluation due to encoder training tasks, showing trade-offs between different encoders across recall, precision, and alignment dimensions.

DetailsMotivation: FAD is the standard for text-to-audio evaluation, but its scores depend on encoder embedding spaces. Different encoders are trained on different tasks (reconstruction, ASR, classification), causing systematic biases in what acoustic features they preserve or discard, making FAD scores incomparable across encoders.

Method: Decompose evaluation into Recall, Precision, and Alignment (with semantic and structural dimensions). Use log-scale normalization for fair cross-encoder comparison. Conduct controlled experiments on six encoders across two datasets to analyze trade-offs.

Result: Reveals four-axis trade-off: AudioMAE (reconstruction-based) leads precision sensitivity; Whisper (ASR-trained) dominates structural detection but is blind to signal degradation; VGGish (classification-trained) maximizes semantic detection but penalizes legitimate intra-class variation. No single encoder serves as universal evaluator.

Conclusion: Future audio evaluation metrics must shift toward evaluation-native encoders that are intrinsically aligned with human perception rather than relying on encoders trained for other tasks.

Abstract: Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder’s embedding space. An encoder’s training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

Relevance: 9/10

[2] Hello-Chat: Towards Realistic Social Audio Interactions

Yueran Hou, Peilei Jia, Zihan Sun, Qihang Lu, Wenbing Yang, Yingming Gao, Ya Li, Jun Gao

Main category: cs.SD

TL;DR: Hello-Chat is an end-to-end audio language model that addresses the robotic “read-speech” style in existing LALMs by using real-life conversation data and modality-interleaved training to achieve more natural, emotionally-aligned audio generation.

DetailsMotivation: Existing Large Audio Language Models (LALMs) suffer from a disconnect between perception and expression, resulting in robotic "read-speech" that lacks the spontaneity and emotional resonance of real human interaction. The authors aim to create a more realistic, anthropomorphic audio generation model for social scenarios.

Method: The authors introduce Hello-Chat, an end-to-end audio language model that leverages a massive dataset of real-life conversations and employs a modality-interleaved training strategy to bridge the gap between perception and expression.

Result: Hello-Chat achieves state-of-the-art performance on specific audio understanding tasks and significantly outperforms existing baselines in prosodic naturalness and emotional alignment, demonstrating breakthrough anthropomorphic generation capabilities.

Conclusion: Hello-Chat represents a significant advancement toward more realistic and empathetic AI agents by addressing the perception-expression disconnect in audio language models, paving the way for next-generation conversational AI.

Abstract: Recent advancements in Large Audio Language Models (LALMs) have demonstrated exceptional performance in speech recognition and translation. However, existing models often suffer from a disconnect between perception and expression, resulting in a robotic “read-speech” style that lacks the spontaneity and emotional resonance of real human interaction. In this report, we introduce Hello-Chat, an end-to-end audio language model designed for realistic social scenarios. By leveraging a massive dataset of real-life conversations and employing a modality-interleaved training strategy, Hello-Chat achieves a breakthrough in anthropomorphic generation. Experimental results show that our model not only reaches state-of-the-art (SOTA) performance on specific audio understanding tasks but also significantly outperforms existing baselines in prosodic naturalness and emotional alignment, paving the way for the next generation of empathetic AI agents.

Relevance: 9/10

[3] Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Siyi Xie, Hanxin Zhu, Xinyi Chen, Tianyu He, Xin Li, Zhibo Chen

Main category: cs.SD

TL;DR: Sonic4D is a framework for generating spatial audio synchronized with 4D dynamic scenes, enabling immersive audiovisual experiences by localizing sound sources in 4D scenes and synthesizing physics-based spatial audio.

DetailsMotivation: Existing 4D generation methods focus only on visual synthesis while overlooking spatial audio generation, creating a limitation for truly immersive audiovisual experiences. There's a need to bridge this gap between visual and auditory modalities.

Method: Three-stage framework: 1) Generate 4D scene and monaural audio from monocular video using pre-trained models, 2) Localize and track sound sources in 4D scene via pixel-level visual grounding to estimate 3D coordinates, 3) Synthesize spatial audio using physics-based simulation based on estimated sound source locations.

Result: The method generates realistic spatial audio consistent with synthesized 4D scenes in a training-free manner, significantly enhancing immersive experience. Extensive experiments demonstrate effectiveness.

Conclusion: Sonic4D successfully bridges the gap between visual and auditory modalities in 4D generation, enabling truly immersive audiovisual experiences by synchronizing spatial audio with dynamic 3D scenes.

Abstract: Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents

Kaifeng Wu, Junyan Wu, Qiang Liu, Jiarui Zhang, Wen Xu

Main category: cs.CL

TL;DR: A discriminative topic segmentation model based on Qwen3-0.6B that uses cross-window context fusion and overlapping sliding windows to handle ultra-long documents up to 13k tokens, with vector compression for efficient retrieval.

DetailsMotivation: Existing methods for long-document topic segmentation have limitations: traditional discriminative models can't capture document-level semantics due to fixed windows, while generative LLMs are computationally expensive and struggle with long inputs. There's a need for practical, scalable solutions for ultra-long document processing.

Method: Proposes a discriminative segmentation model using Qwen3-0.6B backbone with added cross-window context fusion layer and boundary classification head. Uses overlapping sliding-window strategy to handle documents up to 13k tokens. Also develops vector fusion method with scalar correction to compress ultra-long segment representations into single vectors without semantic loss.

Result: On WIKI-727K dataset, achieves better macro-averaged F1 than three generative models based on Qwen2-0.5B, with two orders of magnitude faster inference speed, significantly improving practicality and scalability of long-document processing.

Conclusion: The proposed discriminative model effectively addresses limitations of existing methods for long-document topic segmentation, offering superior performance, faster inference, and better scalability for ultra-long document processing tasks.

Abstract: Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiments on the Wikipedia long-document topic segmentation dataset WIKI-727K show that, compared with three generative models based on Qwen2-0.5B released by Jina, our method achieves a better macro-averaged F1 and delivers two orders of magnitude faster inference, substantially improving the practicality and scalability of long-document processing.

[2] Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Swati Sharma, Divya V. Sharma, Anubha Gupta

Main category: cs.CL

TL;DR: Task-Lens is a cross-task survey that assesses 50 Indian speech datasets across 26 languages for 9 downstream speech tasks, analyzing metadata suitability and proposing enhancements to unlock their full potential.

DetailsMotivation: The need for inclusive speech technologies and multilingual NLP research is growing, but limited awareness of existing task-specific resources in low-resource languages like Indian languages hinders progress. Current surveys typically catalog datasets for single tasks, leaving comprehensive cross-task profiling as an open opportunity.

Method: The authors propose Task-Lens, a cross-task survey methodology that: 1) analyzes which of 50 Indian speech datasets spanning 26 languages contain metadata and properties suitable for 9 downstream speech tasks, 2) proposes task-aligned enhancements to unlock datasets to their full downstream potential, and 3) identifies underserved tasks and Indian languages.

Result: Findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. The survey uncovers cross-task linkages and gaps, enabling researchers to explore broader applicability of existing datasets and prioritize dataset creation for underserved tasks and languages.

Conclusion: Task-Lens provides a framework for cross-task profiling of speech datasets, addressing data scarcity in low-resource languages by revealing untapped potential in existing resources and guiding future dataset development for underserved areas.

Abstract: The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.

[3] Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani

Main category: cs.CL

TL;DR: SLATE: A reinforcement learning framework for training LLMs to reason with search engines using truncated step-level sampling and dense LLM-as-judge rewards to reduce gradient variance and improve credit assignment.

DetailsMotivation: Existing methods for training LLMs to reason with search engines suffer from credit assignment problems (sparse rewards) or rely on heuristic step-level rewards with high gradient variance, limiting effective learning of multi-step reasoning.

Method: Proposes SLATE with two key components: (1) truncated step-level sampling that generates trajectories sharing common prefixes but differing at next steps, and (2) dense LLM-as-judge rewards where a capable LLM evaluates reasoning steps, search queries, and answers instead of heuristic scoring.

Result: Theoretically proves truncated sampling reduces advantage estimate variance by up to factor T for T-step trajectories. Experiments on seven QA benchmarks show SLATE consistently outperforms sparse-reward and process-reward baselines, with largest gains on harder multi-hop tasks and smaller models.

Conclusion: SLATE effectively addresses credit assignment in search-augmented reasoning by combining truncated sampling for lower variance and LLM-as-judge rewards for richer supervision, enabling more efficient training of LLMs for complex reasoning tasks.

Abstract: Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.

[4] CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla, Yanfang Ye

Main category: cs.CL

TL;DR: A benchmark and detection framework for hallucinated citations in scientific writing, using multi-agent verification to assess whether cited sources truly support claims, with experiments showing LLMs produce substantial citation errors.

DetailsMotivation: Large language models introduce risks of fabricated references that appear plausible but correspond to no real publications, with hallucinated citations already observed in submissions at major ML venues, exposing peer review vulnerabilities. Manual verification is impractical due to growing reference lists, and existing automated tools are fragile to noisy citation formats.

Method: Multi-agent verification pipeline that decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment. Constructs large-scale human-validated dataset across domains with unified metrics for citation faithfulness and evidence alignment.

Result: Experiments with state-of-the-art LLMs reveal substantial citation errors. The framework significantly outperforms prior methods in both accuracy and interpretability, providing scalable infrastructure for auditing citations.

Conclusion: This work provides the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing, offering practical tools to improve the trustworthiness of scientific references in the LLM era.

Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.

[5] FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records

Michael Frew, Nishit Bheda, Bryan Tripp

Main category: cs.CL

TL;DR: FHIRPath-QA: A dataset and benchmark for patient-specific clinical question answering using FHIRPath query synthesis instead of free-text generation, reducing LLM usage and improving safety.

DetailsMotivation: Patients need precise, trustworthy answers to clinical questions from their EHRs, but current LLM-based QA approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real EHRs.

Method: Proposes text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis. Creates dataset of 14k+ natural language questions paired with validated FHIRPath queries and answers from MIMIC-IV on FHIR Demo.

Result: State-of-the-art LLMs struggle with patient language ambiguity and perform poorly in FHIRPath query synthesis, but benefit strongly from supervised fine-tuning. Text-to-FHIRPath synthesis shows potential for safe, efficient, interoperable consumer health applications.

Conclusion: FHIRPath query synthesis offers a practical foundation for clinical QA applications, with the dataset serving as a starting point for future research on this text-to-FHIRPath paradigm.

Abstract: Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.

[6] IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi, Spencer Romo, Akhil Nooney, Boyi Xie, Bob Strahan, Diego A. Socolinsky

Main category: cs.CL

TL;DR: IDP Accelerator is a framework for intelligent document processing that uses multimodal LLMs and agentic AI to extract structured insights from complex document packets with high accuracy and efficiency.

DetailsMotivation: Traditional document processing pipelines struggle with multi-document packets, complex reasoning, and strict compliance requirements. There's a need for an end-to-end solution that can handle unstructured documents in industrial settings while maintaining high accuracy and compliance.

Method: Four-component framework: 1) DocSplit benchmark dataset with multimodal classifier using BIO tagging for document segmentation; 2) Configurable Extraction Module using multimodal LLMs; 3) Agentic Analytics Module with Model Context Protocol for secure data access; 4) Rule Validation Module using LLM-driven logic for compliance checks.

Result: Production deployment at a leading healthcare provider achieved 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs compared to legacy baselines.

Conclusion: IDP Accelerator provides an effective framework for industrial document intelligence that combines multimodal LLMs with agentic AI to handle complex document processing tasks with high accuracy and efficiency.

Abstract: Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.

[7] Humans and LLMs Diverge on Probabilistic Inferences

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy

Main category: cs.CL

TL;DR: ProbCOPA dataset reveals LLMs fail to match human probabilistic reasoning patterns on open-ended inferences

DetailsMotivation: To investigate how LLMs handle probabilistic, non-deterministic reasoning compared to humans, since current evaluations focus mainly on logical/mathematical tasks

Method: Created ProbCOPA dataset with 210 handcrafted probabilistic inferences annotated by 25-30 humans each, then compared human judgments with responses from 8 state-of-the-art reasoning LLMs

Result: Human responses showed graded, varied probabilistic judgments, while LLMs consistently failed to produce human-like distributions and exhibited different reasoning patterns

Conclusion: Persistent differences exist between human and LLM probabilistic reasoning, highlighting the need to evaluate reasoning beyond deterministic settings

Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25–30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.

[8] France or Spain or Germany or France: A Neural Account of Non-Redundant Redundant Disjunctions

Sasha Boguraev, Qing Yao, Kyle Mahowald

Main category: cs.CL

TL;DR: The paper examines how seemingly redundant sentences become acceptable in specific contexts, analyzing this phenomenon through neural mechanisms in LLMs rather than just symbolic approaches.

DetailsMotivation: To understand why formally redundant sentences like "She will go to France or Spain, or perhaps to Germany or France" become acceptable in certain contexts, and to provide a neural mechanism-based explanation that complements existing symbolic analyses.

Method: Combined behavioral evidence from humans and large language models with analysis of Transformer mechanisms, specifically examining how models bind contextually relevant information to repeated lexical items and how induction heads attend to context-licensed representations.

Result: Found that redundancy avoidance in language models arises from two interacting mechanisms: contextual binding of information to repeated lexical items, and selective attention of Transformer induction heads to these context-licensed representations.

Conclusion: The neural explanation sheds light on mechanisms underlying context-sensitive semantic interpretation and complements existing symbolic analyses of linguistic redundancy phenomena.

Abstract: Sentences like “She will go to France or Spain, or perhaps to Germany or France.” appear formally redundant, yet become acceptable in contexts such as “Mary will go to a philosophy program in France or Spain, or a mathematics program in Germany or France.” While this phenomenon has typically been analyzed using symbolic formal representations, we aim to provide a complementary account grounded in artificial neural mechanisms. We first present new behavioral evidence from humans and large language models demonstrating the robustness of this apparent non-redundancy across contexts. We then show that, in language models, redundancy avoidance arises from two interacting mechanisms: models learn to bind contextually relevant information to repeated lexical items, and Transformer induction heads selectively attend to these context-licensed representations. We argue that this neural explanation sheds light on the mechanisms underlying context-sensitive semantic interpretation, and that it complements existing symbolic analyses.

[9] Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Shijie Zhang, Hong Va Leong, Ling Feng, Nancy Xiaonan Yu, Qing Li

Main category: cs.CL

TL;DR: A multi-agent causal reasoning framework for suicide risk detection in social media conversations that addresses limitations of existing approaches by generating counterfactual user reactions and mitigating hidden biases like conformity and copycat behavior.

DetailsMotivation: Existing suicide risk detection methods on social media have two major limitations: 1) they rely on predefined rules that capture only narrow user interactions, and 2) they overlook hidden influences like user conformity and suicide copycat behavior that affect suicidal expression and propagation in online communities.

Method: Proposes a Multi-Agent Causal Reasoning (MACR) framework with two collaborative agents: a Reasoning Agent that integrates cognitive appraisal theory to generate counterfactual user reactions to posts and analyzes them through cognitive, emotional, and behavioral dimensions; and a Bias-aware Decision-Making Agent that mitigates hidden biases through front-door adjustment using the counterfactual reactions.

Result: Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.

Conclusion: The MACR framework not only alleviates hidden biases but also enriches contextual information of user interactions with counterfactual knowledge, improving suicide risk detection in social media conversations.

Abstract: Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi-Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias-aware Decision-Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub-agent responsible for each dimension. The Bias-aware Decision-Making Agent mitigates hidden biases through a front-door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias-aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.

[10] BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu

Main category: cs.CL

TL;DR: BRIDGE framework reduces bias in automated essay scoring by generating synthetic high-scoring ELL samples through content-pasting from non-ELL essays while preserving authentic ELL linguistic patterns.

DetailsMotivation: Automated scoring systems using deep learning/LLMs risk amplifying biases against underrepresented groups like English Language Learners (ELLs), particularly due to representation bias where scarce high-scoring ELL samples cause models to favor majority linguistic patterns, leading to unfair under-prediction of ELL students.

Method: BRIDGE (Bias-Reducing Inter-group Data GEneration) synthesizes high-scoring ELL samples by “pasting” construct-relevant content (rubric-aligned knowledge/evidence) from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns, with a discriminator model to ensure synthetic sample quality.

Result: Experiments on California Science Test datasets show BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance, achieving fairness gains comparable to using additional real human data.

Conclusion: BRIDGE offers a cost-effective solution for ensuring equitable automated scoring in large-scale assessments by mitigating representation bias through synthetic data generation rather than relying on scarce minority samples.

Abstract: In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by “pasting” construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

[11] LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering

Rafid Ishrak Jahan, Fahmid Shahriar Iqbal, Sagnik Ray Choudhury

Main category: cs.CL

TL;DR: LFQA-HP-1M: A large-scale dataset of 1.3M human pairwise preference annotations for long-form question answering with rubric-based evaluation framework.

DetailsMotivation: Existing evaluation metrics for long-form question answering often fail to reflect human judgment, creating a need for better evaluation frameworks and datasets.

Method: Created LFQA-HP-1M dataset with 1.3M human pairwise preference annotations, proposed nine rubrics for answer quality evaluation, and compared simple linear models with LLM evaluators while examining biases.

Result: Simple linear models based on rubric features perform comparably to state-of-the-art LLM evaluators; identified transitivity consistency issues, positional bias, verbosity biases, and vulnerability to adversarial perturbations in LLM evaluators.

Conclusion: Provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation of long-form question answering.

Abstract: Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.

[12] LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning

Yu Zhu, Kai Yang

Main category: cs.CL

TL;DR: LLM-driven framework for synthesizing realistic multi-turn task-oriented dialogues to create challenging reasoning benchmarks, addressing limitations of existing simplistic datasets.

DetailsMotivation: Existing reasoning benchmarks are too simplistic and disconnected from real-world scenarios, lacking complexity, domain constraints, and operational rules. Data contamination and labor-intensive crowdsourcing further limit effective evaluation of LLM reasoning in practical contexts.

Method: Proposes an LLM-driven framework using trilevel optimization to synthesize multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios. Generates dialogues with authentic task contexts, real-world information, and contextual coherence, then designs corresponding reasoning tasks that are iteratively refined.

Result: The synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving LLM reasoning capabilities. The resulting dataset serves as a valuable benchmark for assessing realistic logical reasoning in LLMs.

Conclusion: The proposed framework effectively addresses limitations of existing reasoning benchmarks by generating realistic, complex task-oriented dialogues that better evaluate and enhance LLM reasoning in practical contexts.

Abstract: The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs’ logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks’ quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.

[13] TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining

Zitong Xu, Yuqing Wu, Yue Zhao

Main category: cs.CL

TL;DR: TRIZ-RAGNER: A retrieval-augmented LLM framework for extracting TRIZ contradiction parameters from patents using semantic NER with TRIZ knowledge grounding.

DetailsMotivation: Existing TRIZ contradiction mining approaches struggle with semantic ambiguity, domain dependency, and limited generalization in patent analysis. LLMs show promise but suffer from hallucination and lack TRIZ knowledge grounding.

Method: Proposes TRIZ-RAGNER framework that reformulates contradiction mining as semantic NER, integrating dense retrieval over TRIZ knowledge base, cross-encoder reranking, and structured LLM prompting to extract improving/worsening parameters.

Result: Achieves 85.6% precision, 82.9% recall, and 84.2% F1-score on PaTRIZ dataset, outperforming traditional sequence labeling models and LLM baselines by 7.3 percentage points F1 improvement over best baseline.

Conclusion: Retrieval-augmented TRIZ knowledge grounding effectively reduces semantic noise and improves extraction consistency for robust patent-based contradiction mining.

Abstract: TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with semantic ambiguity, domain dependency, and limited generalization when processing complex patent language. Recently, large language models (LLMs) have shown strong semantic understanding capabilities, yet their direct application to TRIZ parameter extraction remains challenging due to hallucination and insufficient grounding in structured TRIZ knowledge. To address these limitations, this paper proposes TRIZ-RAGNER, a retrieval-augmented large language model framework for TRIZ-aware named entity recognition in patent-based contradiction mining. TRIZ-RAGNER reformulates contradiction mining as a semantic-level NER task and integrates dense retrieval over a TRIZ knowledge base, cross-encoder reranking for context refinement, and structured LLM prompting to extract improving and worsening parameters from patent sentences. By injecting domain-specific TRIZ knowledge into the LLM reasoning process, the proposed framework effectively reduces semantic noise and improves extraction consistency. Experiments on the PaTRIZ dataset demonstrate that TRIZ-RAGNER consistently outperforms traditional sequence labeling models and LLM-based baselines. The proposed framework achieves a precision of 85.6%, a recall of 82.9%, and an F1-score of 84.2% in TRIZ contradiction pair identification. Compared with the strongest baseline using prompt-enhanced GPT, TRIZ-RAGNER yields an absolute F1-score improvement of 7.3 percentage points, confirming the effectiveness of retrieval-augmented TRIZ knowledge grounding for robust and accurate patent-based contradiction mining.

[14] From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim, Junhyun Lee, Woohyung Lim

Main category: cs.CL

TL;DR: Agent-centric benchmarking paradigm using autonomous agents to dynamically generate, validate, and solve problems for evaluating LLMs beyond static datasets.

DetailsMotivation: Static datasets for LLM evaluation have limited scalability and fail to capture evolving reasoning capabilities; need dynamic approaches that can adapt to model improvements.

Method: Three-agent system: teacher generates problems, orchestrator validates them and guards against attacks, student solves problems. Invalid problems are revised; successful solutions trigger more challenging variants. Uses text anomaly detection format requiring cross-sentence logical inference.

Result: Protocol systematically exposes corner-case reasoning errors that conventional benchmarks miss, enables progressive evaluation without manual curation, and scales difficulty automatically with agent capabilities.

Conclusion: Shifting from fixed datasets to dynamic protocols offers sustainable evaluation for evolving LLMs and introduces co-evolution of agent-centric benchmarks as a research agenda.

Abstract: The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.

[15] Structured Prompt Optimization for Few-Shot Text Classification via Semantic Alignment in Latent Space

Jiasen Zheng, Zijun Zhou, Huajun Zhang, Junjiang Lin, Jingyun Jia, Qi Wang

Main category: cs.CL

TL;DR: Proposes a structured prompt optimization framework for few-shot text classification that addresses semantic entanglement, unclear label structure, and insufficient feature representation through multi-dimensional semantic factors and cross-space alignment.

DetailsMotivation: Addresses challenges in few-shot text classification: semantic entanglement (confused representations), unclear label structure (ambiguous boundaries), and insufficient feature representation (poor adaptation to low-resource conditions).

Method: Uses pretrained language model for text encoding, introduces structured prompts with multi-dimensional semantic factors, integrates via learnable combination mechanism, constructs structured label embedding matrix, employs cross-space alignment, and applies prompt orthogonality constraints with joint optimization.

Result: Effectively alleviates semantic conflicts and label ambiguity, significantly improves performance on accuracy, precision, recall, and AUC metrics, and demonstrates strong cross-task applicability with robust stability across sensitivity experiments.

Conclusion: The structured prompt optimization framework enhances semantic understanding and task adaptation in few-shot text classification by providing transparent and controllable guidance through multi-dimensional semantic factors and cross-space alignment mechanisms.

Abstract: This study addresses the issues of semantic entanglement, unclear label structure, and insufficient feature representation in few-shot text classification, and proposes an optimization framework based on structured prompts to enhance semantic understanding and task adaptation under low-resource conditions. The framework first uses a pretrained language model to encode the input text and obtain basic semantic representations. It then introduces structured prompts composed of multi-dimensional semantic factors and integrates them with text features through a learnable combination mechanism, which forms task-related representations with clear boundaries in the latent space. To further strengthen the consistency between text representations and label semantics, the method constructs a structured label embedding matrix and employs a cross-space alignment mechanism to ensure stable matching between textual features and label attributes. In addition, the model applies prompt orthogonality constraints and a joint optimization objective to maintain independence across different semantic factors in the prompts, allowing the structured prompts to provide transparent and controllable guidance for classification decisions. Three types of sensitivity experiments, including learning rate sensitivity, prompt length sensitivity, and data scale sensitivity, are designed to evaluate the stability and robustness of the framework under different conditions. Experimental results show that the proposed structured prompt optimization framework effectively alleviates semantic conflicts and label ambiguity in few-shot text classification. It significantly improves performance on accuracy, precision, recall, and AUC, and demonstrates strong cross-task applicability.

[16] Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding

Xiangzhong Luo, Yilin An, Zhicheng Yu, Weichen Liu, Xu Yang

Main category: cs.CL

TL;DR: DiCo introduces an adaptive parallel decoding approach for diffusion-based LLMs using a three-phase divide-and-conquer paradigm to achieve inference speedups while maintaining generation quality.

DetailsMotivation: Current diffusion-based LLMs (dLLMs) have a gap between theoretical parallelism (multiple tokens per step) and practical performance (still using one-token-per-step generation), which limits their speed advantages over autoregressive LLMs.

Method: Three-phase approach: 1) Divide phase identifies seed tokens and expands them into local clusters; 2) Conquer phase performs parallel decoding across clusters; 3) Finalize phase uses fine-grained compound decoding for remaining tokens. The process alternates between Divide and Conquer until convergence.

Result: Extensive experiments show DiCo achieves significant inference speedups while maintaining competitive generation quality compared to existing approaches.

Conclusion: DiCo successfully bridges the gap between theoretical parallelism and practical performance in dLLMs through adaptive parallel decoding, enabling faster inference without sacrificing quality.

Abstract: Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.

[17] GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models

Sebastian Gerstner, Hinrich Schütze

Main category: cs.CL

TL;DR: GLUScope is an open-source tool for analyzing neurons in Transformer-based language models, specifically designed for newer models with gated activation functions like SwiGLU, providing insights into different sign combinations of gate and activation values.

DetailsMotivation: Previous interpretability tools focus on older models and don't adequately handle newer architectures with gated activation functions, which require understanding both gate and activation sign combinations rather than just positive activations.

Method: Developed an open-source tool that analyzes neurons in Transformer models with gated activations, showing text examples for each of the four possible sign combinations (gate/activation: ++, +-, -+, –) and tracking frequency of each combination.

Result: GLUScope successfully analyzes neurons in models with gated activations, revealing that different sign combinations can correspond to distinct functionalities, leading to novel insights about model behavior.

Conclusion: GLUScope addresses the need for interpretability tools that work with modern Transformer architectures featuring gated activation functions, providing researchers with better understanding of neuron behavior in these models.

Abstract: We present GLUScope, an open-source tool for analyzing neurons in Transformer-based language models, intended for interpretability researchers. We focus on more recent models than previous tools do; specifically we consider gated activation functions such as SwiGLU. This introduces a new challenge: understanding positive activations is not enough. Instead, both the gate and the in activation of a neuron can be positive or negative, leading to four different possible sign combinations that in some cases have quite different functionalities. Accordingly, for any neuron, our tool shows text examples for each of the four sign combinations, and indicates how often each combination occurs. We describe examples of how our tool can lead to novel insights. A demo is available at https: //sjgerstner.github.io/gluscope.

[18] CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing

Jian Kai, Zidong Zhang, Jiwen Chen, Zhengxiang Wu, Songtao Sun, Fuyang Li, Yang Cao, Qiang Liu

Main category: cs.CL

TL;DR: CLFEC introduces a new task for joint linguistic and factual error correction in Chinese professional writing, with a multi-domain dataset and systematic study of LLM-based correction methods.

DetailsMotivation: Traditional Chinese text correction focuses separately on spelling/grammar vs. factual errors, but in professional writing these errors co-occur and interact, requiring unified correction approaches.

Method: Created CLFEC dataset spanning current affairs, finance, law, and medicine domains; systematically studied LLM-based correction paradigms including prompting, retrieval-augmented generation (RAG), and agentic workflows.

Result: Found that joint handling of linguistic and factual errors outperforms decoupled processes; agentic workflows can be effective with suitable backbone models; identified challenges including limited generalization, evidence grounding needs, mixed-error difficulty, and over-correction.

Conclusion: The dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings, showing unified correction is both necessary and feasible.

Abstract: Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This paper introduces CLFEC (Chinese Linguistic & Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual Error within the same context outperform decoupled processes, and that agentic workflows can be effective with suitable backbone models. Overall, our dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings.

[19] The Astonishing Ability of Large Language Models to Parse Jabberwockified Language

Gary Lupyan, Senyi Yang

Main category: cs.CL

TL;DR: LLMs can recover meaning from severely degraded English texts where content words are replaced with nonsense strings, demonstrating superhuman ability to use structural cues for semantic inference.

DetailsMotivation: To investigate how much meaning can be recovered from structurally intact but lexically nonsensical texts, and to understand the degree to which structural cues constrain lexical meaning in language processing.

Method: Present LLMs with “Jabberwockified” English texts where content words are randomly substituted with nonsense strings while preserving structural elements like morphosyntax and closed-class words, then evaluate the models’ ability to translate these back to conventional English.

Result: LLMs demonstrate astonishing ability to recover meaning from severely degraded texts, often producing translations close to the original text, showing that structural cues constrain lexical meaning to a much larger degree than previously imagined.

Conclusion: LLMs’ superhuman abilities in processing degraded language reveal tight integration between syntax, lexical semantics, and world knowledge, providing insights into efficient language processing in both biological and artificial systems.

Abstract: We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., “At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp”, can be translated to conventional English that is, in many cases, close to the original text, e.g., “At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife.” These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of “Jabberwockified” English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.

[20] Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

Nischal Karki, Bipesh Subedi, Prakash Poudyal, Rupak Raj Ghimire, Bal Krishna Bal

Main category: cs.CL

TL;DR: Benchmarking study comparing multilingual, Indic, Hindi, and Nepali BERT variants for Nepali topic classification, finding Indic models (especially MuRIL-large) perform best with 90.60% F1-score.

DetailsMotivation: Nepali is a low-resource language written in Devanagari script that remains relatively underexplored in NLP, despite significant advances in transformer-based models like BERT for many languages.

Method: Fine-tuned and tested ten pre-trained models (mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, NepBERTa) on a balanced Nepali dataset of 25,006 sentences across five conceptual domains, evaluating with accuracy, weighted precision, recall, F1-score, and AUROC metrics.

Result: Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with 88.26% F1-score.

Conclusion: The study establishes a robust baseline for future document-level classification and broader Nepali NLP applications, demonstrating the effectiveness of Indic language models for low-resource languages like Nepali.

Abstract: Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.

[21] EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates

Ludovic Moncla, Pierre Nugues, Thierry Joliveau, Katherine McDonough

Main category: cs.CL

TL;DR: A dataset and pipeline for extracting geographic coordinates from historical texts using transformer models, tested on 18th-century French encyclopedias and cross-domain dictionaries.

DetailsMotivation: Automatically recovering geographic coordinates from historical texts is challenging due to varied expressions and precision levels. The authors aim to improve coordinate retrieval from digitized early modern texts by creating a gold standard dataset and training models.

Method: Created a gold standard dataset from 15,278 geographical entries in digitized Encyclopedie, manually identifying 4,798 with coordinates. Trained transformer-based models using a two-step pipeline: classifier to identify coordinate-bearing entries, then a model for retrieval and normalization. Tested encoder-decoder and decoder architectures.

Result: Cross-validation achieved 86% EM score. On out-of-domain tests: 61% EM on 18th-century French Trevoux dictionary, 77% EM on 19th-century English Encyclopaedia Britannica. Shows cross-lingual, cross-domain generalizability.

Conclusion: The gold standard dataset is useful as training data, and the two-step method demonstrates strong cross-lingual and cross-domain generalization capabilities for extracting geographic coordinates from historical texts.

Abstract: This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d’Alembert’s eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset’s usefulness as training data, and our two-step method’s cross-lingual, cross-domain generalizability.

[22] MemEmo: Evaluating Emotion in Memory Systems of Agents

Peng Liu, Zhen Tao, Jihao Zhao, Ding Chen, Yansong Zhang, Cuiping Li, Zhiyu Li, Hong Chen

Main category: cs.CL

TL;DR: Paper proposes HLME benchmark to evaluate memory systems’ ability to handle emotional information in LLMs, finding current systems inadequate across emotional extraction, updating, and QA tasks.

DetailsMotivation: Current memory systems for LLMs address context loss but their efficacy in processing emotion-related information remains unclear compared to human cognition. There's a need to objectively assess how well these systems handle affective information.

Method: Created HLME (Human-Like Memory Emotion) dataset and benchmark to evaluate memory systems across three dimensions: emotional information extraction, emotional memory updating, and emotional memory question answering. Tested mainstream and state-of-the-art memory systems.

Result: Experimental results show none of the evaluated memory systems achieve robust performance across all three emotional memory tasks, revealing significant deficiencies in current approaches.

Conclusion: Current memory systems have substantial limitations in processing emotional memories. The findings provide objective perspective on deficiencies and suggest new research trajectory for optimizing memory systems to better handle affective information.

Abstract: Memory systems address the challenge of context loss in Large Language Model during prolonged interactions. However, compared to human cognition, the efficacy of these systems in processing emotion-related information remains inconclusive. To address this gap, we propose an emotion-enhanced memory evaluation benchmark to assess the performance of mainstream and state-of-the-art memory systems in handling affective information. We developed the \textbf{H}uman-\textbf{L}ike \textbf{M}emory \textbf{E}motion (\textbf{HLME}) dataset, which evaluates memory systems across three dimensions: emotional information extraction, emotional memory updating, and emotional memory question answering. Experimental results indicate that none of the evaluated systems achieve robust performance across all three tasks. Our findings provide an objective perspective on the current deficiencies of memory systems in processing emotional memories and suggest a new trajectory for future research and system optimization.

[23] The GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning

Jonathan Drechsel, Steffen Herbold

Main category: cs.CL

TL;DR: gradiend is an open-source Python package implementing GRADIEND method for learning feature directions from factual-counterfactual MLM and CLM gradients in language models, with tools for data creation, training, evaluation, visualization, and model rewriting.

DetailsMotivation: To provide a unified, open-source toolkit for operationalizing the GRADIEND method that enables systematic analysis and manipulation of feature representations in language models through gradient-based techniques.

Method: The GRADIEND method learns feature directions by analyzing gradients from factual-counterfactual masked language modeling (MLM) and causal language modeling (CLM) tasks. The package implements workflows for feature data creation, training, evaluation, visualization, and persistent model rewriting through controlled weight updates.

Result: The package successfully demonstrates GRADIEND on an English pronoun paradigm and reproduces prior large-scale feature comparison use cases, showing it can effectively learn and manipulate feature directions in language models.

Conclusion: gradiend provides a comprehensive, open-source solution for gradient-based feature analysis and editing in language models, enabling reproducible research and practical applications in model interpretability and control.

Abstract: We present gradiend, an open-source Python package that operationalizes the GRADIEND method for learning feature directions from factual-counterfactual MLM and CLM gradients in language models. The package provides a unified workflow for feature-related data creation, training, evaluation, visualization, persistent model rewriting via controlled weight updates, and multi-feature comparison. We demonstrate GRADIEND on an English pronoun paradigm and on a large-scale feature comparison that reproduces prior use cases.

[24] Dialect and Gender Bias in YouTube’s Spanish Captioning System

Iris Dania Jimenez, Christoph Kern

Main category: cs.CL

TL;DR: YouTube’s Spanish automatic captioning system shows bias against certain Spanish dialects, with systematic disparities in caption quality across regional variations.

DetailsMotivation: Spanish has many regional variations spoken by over 441 million people, but YouTube offers only one automatic captioning system for Spanish, raising concerns about potential dialect bias in accessibility tools.

Method: Analyzed YouTube’s automatic captioning system performance across various Spanish dialects by comparing caption quality for female and male speakers from different regions.

Result: Identified systematic disparities in caption quality that can be attributed to specific Spanish dialects, showing the system is biased against certain regional variations.

Conclusion: Algorithmic technologies on digital platforms need calibration to diverse user populations, as current systems fail to account for linguistic variations within languages.

Abstract: Spanish is the official language of twenty-one countries and is spoken by over 441 million people. Naturally, there are many variations in how Spanish is spoken across these countries. Media platforms such as YouTube rely on automatic speech recognition systems to make their content accessible to different groups of users. However, YouTube offers only one option for automatically generating captions in Spanish. This raises the question: could this captioning system be biased against certain Spanish dialects? This study examines the potential biases in YouTube’s automatic captioning system by analyzing its performance across various Spanish dialects. By comparing the quality of captions for female and male speakers from different regions, we identify systematic disparities which can be attributed to specific dialects. Our study provides further evidence that algorithmic technologies deployed on digital platforms need to be calibrated to the diverse needs and experiences of their user populations.

[25] Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

Donghao Huang, Zhaoxia Wang

Main category: cs.CL

TL;DR: Reasoning in LLMs doesn’t universally improve performance - effectiveness is strongly task-dependent, with reasoning harming simple tasks but helping complex emotion recognition despite high computational costs.

DetailsMotivation: To test the prevailing assumption that reasoning universally improves LLM performance across language tasks, examining whether this holds true across different task complexities and model architectures.

Method: Comprehensive evaluation of 504 configurations across 7 model families (adaptive, conditional, reinforcement learning-based reasoning architectures) on sentiment analysis datasets with varying granularity (binary, 5-class, 27-class emotion).

Result: Reasoning effectiveness is strongly task-dependent: binary classification degrades up to -19.9 F1 points, while 27-class emotion recognition gains up to +16.0 points; distilled reasoning variants underperform base models on simpler tasks; few-shot learning generally helps; base models dominate efficiency-performance trade-offs.

Conclusion: Reasoning doesn’t universally improve LLM performance - it’s only justified for complex emotion recognition despite high computational overhead, challenging prevailing assumptions about reasoning’s universal benefits.

Abstract: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families–including adaptive, conditional, and reinforcement learning-based reasoning architectures–on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence–binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.

[26] Preference Packing: Efficient Preference Optimization for Large Language Models

Jaekyung Cho

Main category: cs.CL

TL;DR: Preference packing method improves training efficiency for preference-based models by reducing redundant attention operations and KV cache usage for duplicate input prompts across multiple responses.

DetailsMotivation: As LLMs grow larger, resource-efficient training becomes crucial. Existing batch packing techniques work for standard training but don't optimize for preference-based training where multiple responses exist for the same prompt, leading to redundant computations.

Method: Proposes preference packing that groups multiple responses for the same input prompt together, reducing attention operations for duplicate prompts and decreasing KV cache memory usage. Can be combined with existing optimization techniques like batch sorting.

Result: Achieved at least 37% reduction in training time on both text-only and image-included datasets. When combined with batch sorting, achieved 3.22x speedup.

Conclusion: Preference packing is an effective resource optimization technique for preference-based training methods like reward models and DPO, offering significant training speed improvements without compromising model quality.

Abstract: Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO). Preference packing improves resource efficiency by reducing the attention operations for duplicate input prompts and decreasing KV cache memory usage. We conducted experiments on text-only datasets and image-included datasets and achieved at least 37% reduction in training time. Notably, this method can be applied alongside existing optimization techniques such as batch sorting, resulting in a 3.22x speedup.

[27] ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts

Sara Nabhani, Federico Pianzola, Khalid Al-Khatib, Malvina Nissim

Main category: cs.CL

TL;DR: ARGUS framework analyzes how narrative features impact persuasion in online argumentation using annotated ChangeMyView corpus and LLM-based analysis

DetailsMotivation: To understand how narratives make arguments more persuasive and identify which narrative features matter most in online, unstructured argumentation, where the specific role of stories remains underexplored

Method: Developed ARGUS framework with new ChangeMyView corpus annotated for story presence and six key narrative features; used encoder-based classifiers and zero-shot LLMs to identify stories and narrative features at scale

Result: Framework enables systematic examination of how different narrative dimensions influence persuasion success in online argumentation

Conclusion: ARGUS provides a comprehensive approach to studying narrative impact on persuasion in argumentative discourse, bridging theoretical frameworks with computational analysis

Abstract: Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, unstructured argumentation remains underexplored. To address this gap, we present ARGUS, a framework for studying the impact of narration on persuasion in argumentative discourse. ARGUS introduces a new ChangeMyView corpus annotated for story presence and six key narrative features, integrating insights from two established theoretical frameworks that capture both textual narrative features and their effects on recipients. Leveraging both encoder-based classifiers and zero-shot large language models (LLMs), ARGUS identifies stories and narrative features and applies them at scale to examine how different narrative dimensions influence persuasion success in online argumentation.

[28] Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek

James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky

Main category: cs.CL

TL;DR: First systematic human evaluation of LLM machine translation for Ancient Greek technical prose, showing high quality for translated texts and moderate quality for untranslated pharmacological texts with terminology rarity as key predictor of failure.

DetailsMotivation: To systematically evaluate the quality of LLM machine translation for Ancient Greek technical prose, particularly for Classical scholarship applications, using both translated and untranslated texts to understand performance limitations.

Method: Evaluated translations by three commercial LLMs (Claude, Gemini, ChatGPT) of 20 paragraph-length passages from Galen’s works using both automated metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via modified Multidimensional Quality Metrics framework.

Result: LLMs achieved high translation quality (mean MQM score 95.2/100) for previously translated expository text, approaching expert level. For untranslated pharmacological text, aggregate quality was lower (79.9/100) with high variance driven by terminological density. Terminology rarity strongly predicted translation failure (r = -.97).

Conclusion: LLMs show promise for Ancient Greek translation but struggle with rare terminology, especially in untranslated technical domains. Automated metrics have limited utility for discriminating high-quality translations, highlighting the need for human evaluation in low-resource ancient languages.

Abstract: This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.

[29] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan

Main category: cs.CL

TL;DR: CoME is a novel mobile agent architecture with four specialized experts for different reasoning stages, using output-oriented activation and progressive training to achieve balanced hybrid capabilities for autonomous task execution.

DetailsMotivation: Existing mobile agents struggle with both decoupled enhancement and balanced integration of hybrid capabilities needed for autonomous task execution (screen summary, subtask planning, action decision, action function).

Method: CoME architecture with four distinct experts aligned to specific reasoning stages, using output-oriented activation. Progressive training strategy: Expert-FT for decoupled capability enhancement, Router-FT for stage alignment, CoT-FT for balanced optimization. InfoGain-Driven DPO (Info-DPO) uses information gain to evaluate intermediate steps and mitigate error propagation.

Result: CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets, demonstrating superior performance in hybrid-capabilities reasoning for mobile task execution.

Conclusion: CoME successfully addresses the challenge of achieving both decoupled enhancement and balanced integration of hybrid capabilities in mobile agents through specialized experts, progressive training, and information-gain-driven optimization.

Abstract: Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts’ capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.

[30] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Adam Dejl, Deniz Gorur, Francesca Toni

Main category: cs.CL

TL;DR: ArgLLM-App is a web-based system implementing argumentative LLM agents for binary decision-making tasks with visual explanations and human contestation capabilities.

DetailsMotivation: To create a system that makes LLM-based decisions more explainable and contestable by humans, addressing the black-box nature of traditional LLMs and enabling better human-AI collaboration.

Method: Develops a modular web-based system (ArgLLM-App) that combines LLMs with computational argumentation frameworks, supports visualization of reasoning chains, allows human interaction to contest decisions, and integrates with trusted external data sources.

Result: A publicly available web application (argllm.app) with video demonstration that enables users to visualize and contest LLM-generated decisions in binary tasks through argumentative reasoning frameworks.

Conclusion: ArgLLM-App successfully demonstrates how argumentative LLMs can enhance explainability and contestability in AI decision-making systems, providing a practical tool for human-AI collaboration.

Abstract: Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system’s reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at https://argllm.app, with a video demonstration at https://youtu.be/vzwlGOr0sPM.

[31] Task-Centric Acceleration of Small-Language Models

Dor Tsur, Sharon Adar, Ran Levy

Main category: cs.CL

TL;DR: TASC is a framework for accelerating small language models through task-adaptive sequence compression, with two methods: TASC-ft for fine-tuning with enriched vocabulary and TASC-spec for training-free speculative decoding using n-gram draft models.

DetailsMotivation: Small language models need efficiency improvements for high-volume, low-latency applications, but current methods have limitations in vocabulary alignment and training requirements.

Method: Two complementary approaches: 1) TASC-ft enriches tokenizer vocabulary with high-frequency output n-grams during fine-tuning, 2) TASC-spec uses lightweight, training-free speculative decoding with n-gram draft models constructed from task output corpus.

Result: The methods show consistent improvements in inference efficiency while maintaining task performance across multiple low output-variability generation tasks.

Conclusion: TASC provides effective acceleration for small language models in task-specific applications through vocabulary optimization and efficient speculative decoding.

Abstract: Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task’s output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.

[32] MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata

Main category: cs.CL

TL;DR: A scalable evaluation methodology for language models in multi-turn collaborative interactions using games requiring communication about private information, revealing models’ weaknesses in planning and executing effective multi-turn conversations despite substantial headroom for improvement.

DetailsMotivation: To develop a systematic way to evaluate language models' capabilities in multi-turn collaborative interactions, particularly focusing on how they handle private information exchange and communication efficiency in interactive settings.

Method: Created MT-PingEval, a suite of collaborative games requiring effective communication about private information, using interactive scaling analysis where fixed token budgets are divided over variable numbers of turns to assess model performance.

Result: Language models often fail to use interactive collaboration to improve over non-interactive baselines despite substantial headroom, showing significant weaknesses in planning and executing multi-turn collaborative conversations. Humans achieve comparable success with superior token efficiency through more coherent dialogues.

Conclusion: State-of-the-art models still have significant limitations in managing private information and executing effective multi-turn conversations, with MT-PingEval providing a framework to drive improvements in these collaborative communication capabilities.

Abstract: We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts – despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.

[33] Controllable Reasoning Models Are Private Thinkers

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

Main category: cs.CL

TL;DR: Training reasoning models to follow instructions in reasoning traces improves privacy by preventing unintended leakage of sensitive information, though with potential trade-offs in task utility.

DetailsMotivation: AI reasoning models often process sensitive user data, and their reasoning traces can inadvertently leak private information. Current models lack control over what information appears in reasoning traces, creating privacy risks.

Method: Fine-tune models on instruction-following datasets with explicit restrictions on reasoning traces. Introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters.

Result: Method achieves up to 20.9 point improvements in instruction-following performance and up to 51.9 percentage point gains on privacy benchmarks across six models (1.7B to 14B parameters).

Conclusion: Improving instruction-following behavior in reasoning models significantly enhances privacy preservation, offering a promising direction for developing privacy-aware AI agents, though with potential trade-offs in task utility.

Abstract: AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models

[34] Do LLMs Benefit From Their Own Words?

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas

Main category: cs.CL

TL;DR: Removing prior assistant responses from conversation history doesn’t harm response quality in many multi-turn LLM interactions, reduces context length by up to 10x, and can actually improve quality by avoiding context pollution issues.

DetailsMotivation: The paper questions the standard design choice of including all previous assistant responses in multi-turn LLM conversation history, investigating whether models truly benefit from conditioning on their own prior responses or if this practice introduces unnecessary overhead and potential quality issues.

Method: The researchers compare standard full-context prompting with user-turn-only prompting (omitting all previous assistant responses) using in-the-wild multi-turn conversations across three open reasoning models and one state-of-the-art model. They analyze conversation characteristics, identify self-contained prompts, and design a context-filtering approach that selectively omits assistant-side context.

Result: Removing prior assistant responses doesn’t affect response quality on a large fraction of turns (36.4% of prompts are self-contained). User-turn-only prompting reduces cumulative context lengths by up to 10x. In cases where it outperforms full context, context pollution issues are identified where models over-condition on previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns.

Conclusion: Selectively omitting assistant history can improve response quality while reducing memory consumption. Many follow-up prompts provide sufficient instruction to be answered using only current and prior user turns, suggesting that standard practice of including all assistant responses may be unnecessarily conservative.

Abstract: Multi-turn interactions with large language models typically retain the assistant’s own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.

[35] TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals

Main category: cs.CL

TL;DR: Hybrid transformer ensemble (BERT, GPT-2, RoBERTa, XLNet, DistilBERT) for sentiment analysis achieves 94-95% accuracy on Twitter and IMDB datasets, outperforming individual models.

DetailsMotivation: To improve sentiment classification accuracy and robustness by addressing challenges like noisy data, contextual ambiguity, and generalization across diverse datasets through a hybrid approach combining multiple transformer models.

Method: Combines BERT (bidirectional context), GPT-2 (generative capabilities), RoBERTa (optimized contextual understanding), XLNet (permutation-based dependency modeling), and DistilBERT (efficient computation) in a hybrid framework with text cleaning, tokenization, and TF-IDF/BoW feature extraction.

Result: Achieved 94% accuracy on Sentiment140 (Twitter) dataset and 95% accuracy on IMDB dataset, outperforming standalone transformer models.

Conclusion: Hybrid transformer ensembles effectively address limitations of individual architectures and show promise for real-world applications like social media monitoring and customer sentiment analysis.

Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94% and 95%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.

[36] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

Main category: cs.CL

TL;DR: PaperCoder is a multi-agent LLM framework that automatically generates operational code repositories from machine learning papers, addressing the reproducibility crisis in ML research.

DetailsMotivation: The paper addresses the reproducibility crisis in machine learning research where code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. This creates significant barriers to scientific progress and validation.

Method: PaperCoder uses a multi-agent LLM framework with three stages: 1) Planning - constructs high-level roadmap, designs system architecture with diagrams, identifies file dependencies, and generates configuration files; 2) Analysis - interprets implementation-specific details; 3) Generation - produces modular, dependency-aware code. Each phase uses specialized agents that collaborate across the pipeline.

Result: PaperCoder demonstrates effectiveness in creating high-quality, faithful implementations from machine learning papers based on both model-based and human evaluations (including authors of the original papers). It consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.

Conclusion: PaperCoder successfully addresses the reproducibility problem in ML research by leveraging LLMs’ capabilities in understanding scientific documents and generating high-quality code, providing an automated solution for transforming papers into operational code repositories.

Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.

[37] FineScope : SAE-guided Data Selection Enables Domain Specific LLM Pruning and Finetuning

Chaitali Bhattacharyya, Hyunsei Lee, Junyoung Lee, Shinhyoung Jang, Il hong Suh, Yeseong Kim

Main category: cs.CL

TL;DR: FineScope is a framework for creating compact, domain-specific LLMs from larger pretrained models using sparse autoencoders for feature extraction, structured pruning with domain constraints, and self-data distillation with SAE-curated datasets.

DetailsMotivation: Training large LLMs from scratch is computationally expensive, and existing medium-sized models often suffer accuracy degradation on specialized datasets. There's a need for efficient domain-specific LLMs that maintain strong task performance.

Method: Uses Sparse Autoencoder (SAE) framework to extract domain-specific feature representations from large datasets. Applies structured pruning with domain-specific constraints to create compact models. Then uses self-data distillation with SAE-curated datasets to restore domain-specific knowledge lost during pruning.

Result: FineScope achieves competitive performance, outperforming several large-scale SOTA LLMs in domain-specific tasks. Pruned models regain substantial original performance when fine-tuned with SAE-curated datasets. The approach also improves domain-specific accuracy of pretrained LLMs even without pruning.

Conclusion: FineScope provides an effective framework for creating efficient, domain-specific LLMs that maintain strong performance through SAE-based feature extraction, constrained pruning, and self-data distillation.

Abstract: Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama} have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach.

[38] REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang

Main category: cs.CL

TL;DR: REA-RL is a reinforcement learning approach that reduces inference costs in Large Reasoning Models by using a small reflection model for efficient online training and a reflection reward to maintain reasoning quality while shortening responses.

DetailsMotivation: Large Reasoning Models (LRMs) suffer from overthinking which leads to high inference costs. Existing methods for shortening reasoning responses are inefficient for online use (time-consuming data generation) or use simple length rewards that harm performance by losing reflection ability.

Method: Proposes REA-RL with two key components: 1) A small reflection model for efficient online training that enables both parallel sampling and sequential revision, and 2) A reflection reward designed to prevent models from favoring short but non-reflective responses.

Result: The methods maintain or enhance performance while significantly improving inference efficiency. Combined approach reduces inference costs by 36% without compromising performance. Models maintain reflection frequency for hard problems while appropriately reducing it for easier ones.

Conclusion: REA-RL effectively balances performance and efficiency in Large Reasoning Models by addressing overthinking through efficient online reinforcement learning with reflection preservation mechanisms.

Abstract: Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but it tends to lose reflection ability and harm performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 36% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for easier ones without losing reflection ability. Code is available at https://github.com/hexuandeng/REA-RL.

[39] Tracing and Reversing Edits in LLMs

Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer

Main category: cs.CL

TL;DR: A method for tracing and reversing knowledge edits in LLMs to defend against malicious manipulation, achieving high accuracy in identifying edited entities and reversing edits without access to editing prompts.

DetailsMotivation: Knowledge editing methods for LLMs pose dual-use risks - while beneficial for updating information, they can be exploited to implant misinformation or bias. There's a need for robust techniques to detect, interpret, and mitigate malicious edits to safeguard LLMs against adversarial manipulation.

Method: Proposes novel methods for tracing and reversing edits. For tracing: infers edited object entity solely based on modified weights without access to editing prompts. For reversing: uses training-free method to reverse edits and regain original model’s output distribution without edit information. Can also distinguish between edited and unedited weights.

Result: Achieves up to 99% accuracy in inferring edited object entities from modified weights alone. Successfully reverses up to 94% of edits and helps regain original model’s output distribution. Method can also distinguish between edited and unedited weights.

Conclusion: Demonstrates feasibility of tracing and reversing edits based on edited weights, opening new research direction for safeguarding LLMs against adversarial manipulations through knowledge editing.

Abstract: Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate malicious edits. To that end, we introduce the tasks of tracing and reversing edits. We propose a novel method to infer the edited object entity, solely based on the modified weights, without access to the editing prompt or any other semantically similar prompts, with up to 99% accuracy. Further, we propose an effective and training-free method for reversing edits. Our method reverses up to 94% of the edits, and helps regain the original model’s output distribution without access to any information about the edit. This method can further be repurposed to distinguish between edited and unedited weights. Our findings highlight the feasibility of tracing and reversing edits based on the edited weights, opening a new research direction for safeguarding LLMs against adversarial manipulations.

[40] Measuring Sycophancy of Language Models in Multi-turn Dialogues

Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi

Main category: cs.CL

TL;DR: SYCON Bench is a benchmark for evaluating sycophantic behavior in multi-turn conversational LLMs, measuring how quickly models conform to user beliefs and how often they flip stances under pressure.

DetailsMotivation: LLMs often exhibit sycophancy (conforming to user beliefs regardless of factual accuracy), but prior research focused only on single-turn factual correctness, overlooking real-world conversational dynamics.

Method: Created SYCON Bench to measure sycophancy in multi-turn, free-form conversations across three real-world scenarios. Evaluated 17 LLMs on two metrics: Turn of Flip (how quickly models conform) and Number of Flip (frequency of stance shifts under pressure).

Result: Sycophancy remains prevalent; alignment tuning amplifies it, while model scaling and reasoning optimization strengthen resistance to undesirable views. Reasoning models outperform instruction-tuned models but fail when over-indexing on logic. Third-person perspective prompting reduces sycophancy by up to 63.8%.

Conclusion: Sycophancy is a significant failure mode in conversational LLMs that requires multi-turn evaluation. Model architecture and prompting strategies significantly affect sycophantic behavior, with third-person perspective being particularly effective at reducing it.

Abstract: Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy–conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model’s ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user’s underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.

[41] DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance

Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi, Heshaam Faili

Main category: cs.CL

TL;DR: DeepQuestion is a framework that systematically increases dataset complexity using Bloom’s taxonomy to create scenario-based problems and instruction-based prompts, revealing significant performance drops in LLMs as cognitive demands increase.

DetailsMotivation: Current LLMs perform well on standard benchmarks but fail on complex real-world problems, indicating that existing benchmarks overestimate true reasoning capabilities and lack cognitive diversity.

Method: Developed DeepQuestion framework grounded in Bloom’s taxonomy to generate: 1) scenario-based problems testing knowledge application in realistic contexts, and 2) instruction-based prompts requiring models to create questions from solutions to assess synthesis/evaluation skills.

Result: Evaluation across 10 leading open-source and proprietary models showed performance decline up to 70% as tasks ascend cognitive hierarchy, revealing current benchmarks overestimate reasoning abilities.

Conclusion: Current benchmarks insufficiently measure LLM reasoning; cognitively diverse evaluations are critical for guiding future LLM development toward true complex problem-solving capabilities.

Abstract: While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets. Grounded in Bloom’s taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models reveals a stark performance decline with accuracy dropping by up to 70% as tasks ascend the cognitive hierarchy. These findings underscore that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

[42] PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Petr Anokhin, Nikita Semenov, Evgeny Burnaev

Main category: cs.CL

TL;DR: A knowledge graph-based external memory framework for LLMs that automatically constructs and updates memory using hybrid graph structures with hyper-edges for semantic and temporal representations, supporting diverse retrieval mechanisms.

DetailsMotivation: Current LLMs with RAG lack structured memory and fail to scale in complex, long-term interactions, needing better personalization and adaptation to user history.

Method: Proposes a flexible external memory framework based on knowledge graphs that automatically constructs and updates memory using LLMs. Uses hybrid graph design with standard edges and two types of hyper-edges for semantic/temporal representations. Supports diverse retrieval mechanisms including A*, water-circle traversal, beam search, and hybrid methods.

Result: Evaluated on TriviaQA, HotpotQA, and DiaASQ benchmarks, showing different memory/retrieval configurations yield optimal performance depending on task. Extended DiaASQ with temporal annotations and contradictory statements, demonstrating robustness in managing temporal dependencies and context-aware reasoning.

Conclusion: The knowledge graph-based memory framework effectively addresses LLM memory limitations, enabling better personalization and adaptation to complex, long-term interactions through structured memory and flexible retrieval mechanisms.

Abstract: Personalizing language models that effectively incorporating user interaction history remains a central challenge in development of adaptive AI systems. While large language models (LLMs), combined with Retrieval-Augmented Generation (RAG), have improved factual accuracy, they often lack structured memory and fail to scale in complex, long-term interactions. To address this, we propose a flexible external memory framework based on knowledge graph, which construct and update memory model automatically by LLM itself. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyper-edges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A*, water-circle traversal, beam search and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task. Additionally, we extend the DiaASQ benchmark with temporal annotations and internally contradictory statements, showing that our system remains robust and effective in managing temporal dependencies and context-aware reasoning.

[43] MLP Memory: A Retriever-Pretrained Memory for Large Language Models

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin

Main category: cs.CL

TL;DR: MLP Memory is a parametric module that learns retrieval patterns from a kNN retriever, integrated with Transformer decoders to improve knowledge access while maintaining fast inference.

DetailsMotivation: Address the trade-off between RAG (flexible external knowledge but high latency) and parametric fine-tuning (risk of catastrophic forgetting and degraded capabilities) in enhancing LLMs' factual accuracy.

Method: Pretrain an MLP to imitate a kNN retriever’s behavior on the entire pretraining dataset, then integrate this MLP Memory with Transformer decoders through probability interpolation.

Result: Achieves 17.5% and 24.1% scaling gains on WikiText-103 and Web datasets, 12.3% improvement on QA benchmarks, 5.2 points gain across NLP tasks, reduces hallucinations by up to 10 points, and provides 2.5× faster inference than RAG.

Conclusion: Learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.

Abstract: Modern approaches to enhancing Large Language Models’ factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever’s behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, yielding 17.5% and 24.1% scaling gains on WikiText-103 and Web datasets, respectively. It further achieves 12.3% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.

[44] FeynTune: Large Language Models for High-Energy Theory

Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis

Main category: cs.CL

TL;DR: Fine-tuned Llama-3.1 models specialized for High-Energy Physics using arXiv abstracts, outperforming base models and compared against commercial LLMs.

DetailsMotivation: To develop specialized language models for theoretical High-Energy Physics by fine-tuning on domain-specific arXiv abstracts to improve performance on physics-related text completion tasks.

Method: Created 20 fine-tuned variants of 8B Llama-3.1 using arXiv abstracts from hep-th, hep-ph, and gr-qc categories through August 2024. Used two distinct Low-Rank Adaptation (LoRA) fine-tuning approaches with varying dataset sizes. Also trained comparative models on q-bio and cs categories.

Result: All fine-tuned models outperformed the base Llama-3.1 model on hep-th abstract completion tasks. Performance was compared against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek).

Conclusion: Specialized language models for High-Energy Theoretical Physics can be effectively created through fine-tuning on domain-specific data, providing insights for further development of specialized LLMs in physics.

Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

[45] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Jungsuk Oh, Jay-Yoon Lee

Main category: cs.CL

TL;DR: Latent Self-Consistency (LSC) improves LLM output consistency by selecting semantically similar responses using learnable token embeddings, working across short and long-form tasks with minimal overhead.

DetailsMotivation: Current consistency methods like Self-Consistency work well for short-form QA but fail on long-form responses, while Universal Self-Consistency and Weighted Unigram Consistency Score extend to long-form but lose accuracy on short-form benchmarks. There's a need for a unified approach that maintains accuracy across both formats with minimal computational overhead.

Method: LSC selects the most semantically consistent response using learnable token embeddings rather than exact string matching. It processes only summary tokens with lightweight forward processing, introducing negligible runtime overhead (≤0.9%) and requiring no architectural changes to the base LLM.

Result: Across 6 short-form and 5 long-form reasoning benchmarks (MATH, MMLU, TruthfulQA), LSC surpasses SC, USC, and WUCS on average performance for both formats. It provides well-calibrated confidence estimates with low expected calibration error across answer formats.

Conclusion: LSC is a reliable consistency-selection method that works effectively across various answer formats with negligible computational overhead, positioning it as a practical solution for improving LLM output consistency in both short and long-form reasoning tasks.

Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce \textbf{Latent Self-Consistency (LSC)}, which selects the most semantically consistent response using learnable token embeddings. LSC’s lightweight forward processing of summary tokens only introduces negligible runtime overhead (at most $0.9%$) on top of standard decoding of the base LLM, and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC, and WUCS on both short-form and long-form on average performance, while adding negligible computational overhead on vanilla inference. These results position LSC as a reliable consistency-selection method that works effectively across various answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low expected calibration error across both answer formats.

[46] Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng

Main category: cs.CL

TL;DR: MetaAPO is a novel preference optimization framework that dynamically couples online data generation with model training using a meta-learner to estimate alignment gaps and balance online/offline data quality and distribution.

DetailsMotivation: Existing preference optimization methods suffer from distribution mismatch between offline preference data and evolving model policies, using static heuristics or decoupled online sampling that fail to adapt to dynamic learning states.

Method: MetaAPO employs a lightweight meta-learner as an “alignment gap estimator” to evaluate benefits of on-policy sampling vs offline data, guiding targeted online generation and assigning sample-wise meta-weights to dynamically balance online/offline data quality and distribution.

Result: MetaAPO consistently outperforms existing preference optimization approaches on AlpacaEval 2, Arena-Hard and MT-Bench benchmarks while reducing online annotation costs by 42%.

Conclusion: MetaAPO effectively bridges the distribution gap in preference optimization through dynamic coupling of data generation and model training, achieving superior performance with reduced annotation costs.

Abstract: Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model’s dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an “alignment gap estimator”, to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs. Code is available at https://github.com/junming-yang/MetaAPO.

[47] MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra

Main category: cs.CL

TL;DR: MobileLLM-R1 demonstrates that strong reasoning capabilities in sub-billion-parameter models can emerge with only ~2T tokens of high-quality curated data, challenging the assumption that massive datasets (>10T tokens) are necessary for reasoning emergence.

DetailsMotivation: To challenge the prevailing assumption that reasoning capabilities in LLMs require training on massive datasets (>10T tokens), showing that careful data curation and quality can enable reasoning emergence with far less data.

Method: Curated and resampled open-source datasets using designed metrics to identify beneficial data, trained MobileLLM-R1 series (sub-billion parameters) on only ~2T tokens of high-quality data, followed by established post-training procedures.

Result: MobileLLM-R1-950M achieves AIME score of 15.5 vs 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B, matches/surpasses Qwen3-0.6B on reasoning benchmarks despite using only 11.7% of Qwen3’s tokens.

Conclusion: Reasoning capabilities can emerge with far less data than previously assumed through careful data curation, enabling efficient sub-billion-parameter reasoning models that outperform larger models trained on more data.

Abstract: The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have made the models (https://huggingface.co/collections/facebook/mobilellm-r1) and code (https://github.com/facebookresearch/MobileLLM-R1) publicly available, along with the complete training recipe, data sources, and data mixing ratios.

[48] Scaling Generalist Data-Analytic Agents

Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Main category: cs.CL

TL;DR: DataMind: A scalable data synthesis and agent training framework for building generalist data-analytic agents that outperform proprietary models on data analysis benchmarks.

DetailsMotivation: Current data-analytic agents rely heavily on proprietary models with prompt engineering, while open-source models struggle with diverse data formats, large-scale files, and multi-step reasoning required for real-world analytics.

Method: 1) Fine-grained task taxonomy with recursive easy-to-hard composition for diverse queries; 2) Knowledge-augmented trajectory sampling with filtering; 3) Dynamic training combining SFT and RL losses; 4) Memory-frugal, stable code-based multi-turn rollout framework.

Result: DataMind-14B achieves SOTA 71.16% average score on data analysis benchmarks, outperforming DeepSeek-V3.1 and GPT-5. DataMind-7B scores 68.10%, best among open-source models. DataMind-12K dataset created with diverse domains and formats.

Conclusion: DataMind provides effective framework for building open-source data-analytic agents that can handle real-world analytics challenges, with released datasets and models for community research.

Abstract: Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community’s future research.

[49] Unraveling Syntax: How Language Models Learn Context-Free Grammars

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

Main category: cs.CL

TL;DR: The paper analyzes how neural language models learn context-free grammar substructures, proving theoretical results about loss decomposition and showing transformers learn subgrammars in parallel rather than sequentially like children.

DetailsMotivation: To understand the learning dynamics of large models, particularly how they process hierarchical structures like context-free grammars, and to analyze their behavior with respect to grammar substructures (subgrammars).

Method: Theoretical analysis defining subgrammars and proving theorems about language modeling loss decomposition over subgrammars, plus empirical validation with small transformers on CFG tasks.

Result: Proved that language modeling loss decomposes linearly over subgrammars, showed transformers learn subgrammars in parallel (unlike children’s sequential learning), found subgrammar pretraining helps tiny models and improves internal alignment with grammar structure.

Conclusion: Language models exhibit different learning patterns than humans for hierarchical structures, with parallel subgrammar learning and persistent difficulty with deep recursion even in large models.

Abstract: While large models achieve impressive results, their learning dynamics are far from understood. Many domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely “subgrammars”. We first define subgrammars, and prove a set of fundamental theorems regarding language modeling and subgrammars. We show that language modeling loss (or equivalently the Kullback-Leibler divergence) recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for “irreducible” subgrammars. We also prove that the constant in this linear recurrence is a function of the expected recursion, a notion we introduce. We show that under additional assumptions, parametrized models learn subgrammars in parallel. Empirically, we confirm that small transformers learn subgrammars in parallel, unlike children, who first master simple substructures. We also briefly explore several other questions regarding subgrammars. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently lead to internal representations that better reflect the grammar’s substructure in all cases; we also observe persistent difficulty with deeper recursion, a limitation that appears even of large language models.

[50] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Angie Boggust, Donghao Ren, Yannick Assogba, Dominik Moritz, Arvind Satyanarayan, Fred Hohman

Main category: cs.CL

TL;DR: Semantic regexes provide structured language descriptions for LLM features, offering more precise and consistent interpretability than natural language descriptions.

DetailsMotivation: Natural language descriptions of LLM features are often vague, inconsistent, and require manual relabeling, limiting automated interpretability. There's a need for more structured, precise feature descriptions.

Method: Introduces semantic regexes that combine linguistic and semantic pattern primitives with modifiers for contextualization, composition, and quantification to create structured feature descriptions.

Result: Semantic regexes match natural language accuracy while being more concise and consistent. They enable new analyses like quantifying feature complexity across layers and scaling interpretability to model-wide patterns.

Conclusion: Semantic regexes provide a structured approach to LLM interpretability that helps people build accurate mental models of features, overcoming limitations of natural language descriptions.

Abstract: Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, natural language feature descriptions can be vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regexes help people build accurate mental models of LLM features.

[51] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Manuel Frank, Haithem Afli

Main category: cs.CL

TL;DR: PTEB introduces a dynamic evaluation protocol for sentence embeddings that generates stochastic paraphrases at test time to better assess real-world robustness beyond static benchmarks like MTEB.

DetailsMotivation: Static benchmarks like MTEB can lead to overfitting and inflated scores, obscuring real-world robustness. There's a need for dynamic evaluation that tests models' sensitivity to semantic-preserving surface form variations.

Method: Developed PTEB protocol using LLM-based method to generate meaning-preserving paraphrases at evaluation time, validated with gold ratings and human validation. Applied across 7 MTEB tasks, 20 datasets, and 25 languages with statistical robustness over multiple runs.

Result: Showed sentence encoder performance is sensitive to token space changes even when semantics remain fixed. Found smaller models are not disproportionately affected relative to larger ones. Demonstrated statistical robustness across diverse datasets and languages.

Conclusion: Proposes a new NLP evaluation paradigm shifting from static benchmarks to dynamic, stochastic evaluation leveraging eval-time compute. PTEB provides a more realistic assessment of sentence embedding robustness to surface form variations.

Abstract: Current sentence embedding evaluations typically rely on static test beds like the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in gold ratings and human validation, we show that LLMs generate token-diverse but semantically preserving paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs spanning 20 datasets and 25 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute. We make the code to run PTEB publicly available.

[52] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of Large Language Models

Shuichiro Haruta, Kazunori Matsumoto, Zhi Li, Yanan Wang, Mori Kurokawa

Main category: cs.CL

TL;DR: Rotation-constrained compensation method for structured pruning of LLMs that preserves output geometry while reducing errors, with variance-aware importance scoring for better component retention.

DetailsMotivation: Structured pruning of LLMs with limited calibration data causes output mismatches, and direct least-squares fitting overfits and destructively modifies pretrained weights, requiring a geometry-preserving compensation method.

Method: Proposes rotation-constrained updates that preserve output representation geometry (norms and inner products) while re-aligning pruned subspace with original outputs, combined with variance-aware importance scoring to prioritize retention of dimensions affecting principal output directions.

Result: Applied to Llama-7B and Llama-2-13B, shows consistently better perplexity on WikiText2 and higher task accuracy on multiple language understanding benchmarks compared to existing baselines.

Conclusion: The rotation-constrained compensation method effectively reduces pruning errors while preserving important geometric properties of LLM representations, outperforming existing pruning techniques.

Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to Llama-7B and Llama-2-13B, and evaluate it on WikiText2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.

[53] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi

Main category: cs.CL

TL;DR: DiffAdapt: Lightweight framework that adapts inference strategies based on problem difficulty and reasoning trace entropy to reduce LLM overthinking and improve efficiency

DetailsMotivation: Large Language Models often generate unnecessarily long reasoning traces (overthinking), especially on easy problems, wasting computational resources. The paper aims to improve LLM efficiency by reducing this overthinking while maintaining performance.

Method: Analyzes entropy patterns in reasoning traces, identifies U-shaped entropy distribution across difficulty levels, then introduces DiffAdapt - a lightweight probe that classifies final hidden states to select appropriate inference strategies (Easy/Normal/Hard) with different prompts, temperature, and token length settings.

Result: Achieves comparable or improved accuracy while reducing token usage by up to 22.4% across five models and eight benchmarks, demonstrating practical compute-efficient reasoning.

Conclusion: DiffAdapt provides an effective, lightweight approach to reduce LLM overthinking and improve computational efficiency without fine-tuning base models, establishing a practical path toward more efficient reasoning systems.

Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.

[54] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim

Main category: cs.CL

TL;DR: Continual pre-training with LoRA enables efficient adaptation of LLMs to low-resource dialects like Québec French using minimal data and compute, improving dialect performance with minimal regression on standard language benchmarks.

DetailsMotivation: LLMs have strong capabilities but are confined to high-resource languages with abundant training data. There's a need to expand access to minority linguistic communities by adapting models to low-resource regional dialects efficiently.

Method: Use continual pre-training (CPT) with low-rank adaptation (LoRA) and compute-efficient techniques to adapt three LLMs to Québec French dialect using a very small dataset. Evaluate on COLE suite benchmarks.

Result: Demonstrated improvement on minority dialect benchmarks with minimal regression on prestige language benchmarks, updating only around 1% of model parameters. Gains are highly contingent on corpus composition.

Conclusion: CPT with parameter-efficient fine-tuning can narrow the dialect gap cost-effectively, expanding high-quality LLM access to minority linguistic communities. Released first Québec French LLMs on Hugging Face.

Abstract: Despite the widespread adoption of Large Language Models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with around 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. To support reproducibility and broaden access, we release the first Québec French LLMs on Hugging Face.

[55] Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures

Shenran Wang, Timothy Tin-Long Tse, Jian Zhu

Main category: cs.CL

TL;DR: The paper evaluates in-context learning across transformer, state-space, and hybrid LLMs using behavioral probing and mechanistic analysis, finding architectural differences in how models implement ICL despite similar performance.

DetailsMotivation: To understand how different LLM architectures (transformer, state-space, hybrid) implement in-context learning, and whether similar task performance masks underlying mechanistic differences.

Method: Combination of behavioral probing and intervention-based methods on knowledge-based ICL tasks across state-of-the-art models of different architectures.

Result: Found that while LLMs of different architectures can behave similarly in task performance, their internal mechanisms differ. Function vectors for ICL are primarily in self-attention and Mamba layers, with Mamba2 using different mechanisms. FVs are more important for parametric knowledge retrieval than contextual understanding.

Conclusion: Different LLM architectures use distinct internal mechanisms for ICL despite similar behavioral outputs, highlighting the importance of combining behavioral and mechanistic analyses for understanding model capabilities.

Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.

[56] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

Main category: cs.CL

TL;DR: SRL is a new training framework that combines supervised learning and reinforcement learning for reasoning tasks, enabling small LLMs to solve complex multi-step problems by generating internal reasoning monologues before actions.

DetailsMotivation: Current methods like RLVR fail when correct solutions are rarely sampled, while SFT overfits through rigid token-by-token imitation. There's a need for better training approaches for multi-step reasoning in small LLMs.

Method: SRL reformulates problem solving as generating sequences of logical actions. It trains models to generate internal reasoning monologues before committing to each action, providing smoother rewards based on similarity between model’s actions and expert actions from SFT datasets in step-wise manner.

Result: SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Initializing with SRL before refining with RLVR yields strongest overall performance. Generalizes effectively to agentic software engineering tasks.

Conclusion: SRL is a robust and versatile training framework for reasoning-oriented LLMs that bridges the gap between supervised and reinforcement learning approaches for multi-step reasoning tasks.

Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical “actions”. SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model’s actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.

[57] Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization

Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi

Main category: cs.CL

TL;DR: ARF pipeline enables smaller open-source LLMs to outperform larger proprietary models in customer service summarization through error analysis, targeted revision, and fine-tuning on refined data.

DetailsMotivation: To address the limitations of large proprietary models (cost, privacy) by enabling smaller open-source LLMs to achieve superior performance in specific downstream tasks like customer service summarization.

Method: Analyze-Revise-Finetune (ARF) pipeline: 1) Analyze and categorize errors in teacher model (GPT-3.5) summaries, 2) Use compact editor model (Llama 3.1 70B) for targeted revision to create high-quality training data, 3) Fine-tune smaller student models (Llama 3.1 8B, QWen3 4B) on refined data.

Result: Smaller student models achieved superior summarization performance compared to GPT-3.5, with improved cost efficiency and data privacy while maintaining competitive accuracy.

Conclusion: ARF provides a generalizable framework for enhancing open-source LLMs across diverse downstream applications, demonstrating that smaller models can surpass larger proprietary ones through targeted training on refined data.

Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning smaller student models (e.g., Llama 3.1 8B, QWen3 4B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.

[58] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki

Main category: cs.CL

TL;DR: Extended Greek Dialectal Dataset (GRDD+) with 6.4M words covering 10 Greek varieties, used to fine-tune LLMs and compare with frontier models.

DetailsMotivation: To create a comprehensive dataset for Greek dialectal variation and study how high-quality dialectal data affects LLM performance across different varieties.

Method: Extended existing GRDD dataset with more data from 4 existing varieties and added 6 new Greek varieties. Fine-tuned three 8B parameter LLMs (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compared results with frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

Result: Created GRDD+ dataset with 6,374,939 words covering 10 Greek varieties - the largest and most varied Greek dialectal dataset to date. Fine-tuning experiments showed the impact of dialectal data on LLM performance.

Conclusion: The GRDD+ dataset enables better study of Greek dialectal variation and demonstrates how dialect-specific data can improve LLM performance on regional language varieties.

Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).

[59] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways

Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon, Thomas Demeester, Edward Choi

Main category: cs.CL

TL;DR: A framework for quantifying explicit and implicit uncertainty in radiology reports to create structured, uncertainty-aware datasets for medical AI applications.

DetailsMotivation: Radiology reports contain valuable clinical information but are unstructured and contain uncertainty that needs quantification for automated analysis. Two types of uncertainty exist: explicit (hedging phrases) and implicit (omitted reasoning).

Method: Two-part framework: 1) Quantify explicit uncertainty using LLM-based reference ranking of hedging phrases mapped to probability values, 2) Model implicit uncertainty through expansion framework adding characteristic sub-findings from expert-defined diagnostic pathways for 14 common diagnoses.

Result: Created Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports, enabling uncertainty-aware image classification and diagnostic reasoning.

Conclusion: The framework successfully addresses uncertainty in radiology reports, creating enriched resources for clinical AI applications and enabling new investigations into diagnostic uncertainty.

Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.

[60] Steering Language Models with Weight Arithmetic

Constanza Fierro, Fabien Roger

Main category: cs.CL

TL;DR: Contrastive weight steering edits LLM parameters using weight arithmetic to modify behaviors by isolating and adding/removing behavior directions in weight-space from fine-tuning deltas.

DetailsMotivation: Providing high-quality feedback to LLMs across diverse distributions is difficult and expensive, while narrow training data can cause unintended generalizations. Need better methods to leverage limited training data for behavioral control.

Method: Contrastive weight steering: subtract weight deltas from two small fine-tunes (one inducing desired behavior, another inducing opposite) to isolate behavior direction in weight-space, then add/remove this direction to modify model weights.

Result: Weight steering generalizes better than activation steering, achieving stronger out-of-distribution behavioral control before degrading capabilities. Can mitigate sycophancy and under-refusals from fine-tuning while preserving task performance. Preliminary evidence suggests emergent misalignment can be detected by measuring similarity between fine-tuning updates and “evil” weight directions.

Conclusion: Weight steering is an effective post-training method for behavioral control that generalizes well and can monitor for misalignment during training, offering practical advantages over activation-based approaches.

Abstract: Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes – one that induces the desired behavior and another that induces its opposite – and then add or remove this direction to modify the model’s weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an “evil” weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

[61] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

Main category: cs.CL

TL;DR: LLMs show varying moral judgment patterns when role-playing personas, with Claude models being most robust to moral shifts while larger models are more susceptible to moral influence.

DetailsMotivation: As LLMs increasingly operate in social contexts, understanding how they express and shift moral judgments when assuming different personas is crucial for responsible AI development and deployment.

Method: Used Moral Foundations Questionnaire (MFQ) to create a benchmark measuring moral susceptibility (variability across personas) and moral robustness (variability within personas). Tested various LLMs with persona role-play prompts.

Result: Claude family is most morally robust, followed by Gemini and GPT-4. Larger models within families are more morally susceptible. Moral robustness and susceptibility are positively correlated, especially at family level.

Conclusion: Persona conditioning significantly shapes moral behavior in LLMs, with systematic differences across model families and sizes, providing insights for understanding AI moral reasoning.

Abstract: Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs.

[62] Interpreting Transformers Through Attention Head Intervention

Mason Kadem, Rong Zheng

Main category: cs.CL

TL;DR: Attention head intervention has emerged as a key method for causal interpretability of transformers, enabling direct intervention to validate mechanistic hypotheses and control model behavior.

DetailsMotivation: The paper aims to address the lack of understanding of neural mechanisms in increasingly capable neural networks. Mechanistic interpretability is crucial for accountability in high-stakes domains, studying digital cognition, and discovering new knowledge when AI systems outperform humans.

Method: The paper traces the evolution of attention head intervention as a key method for causal interpretability of transformers. It represents a paradigm shift from visualization and correlation-based observation to direct intervention that causally validates mechanistic hypotheses.

Result: Attention head intervention studies revealed robust empirical findings while highlighting limitations that complicate interpretation. Recent work demonstrates that mechanistic understanding enables targeted control of model behavior, successfully suppressing toxic outputs and manipulating semantic content through selective attention head intervention.

Conclusion: Mechanistic interpretability research, particularly through attention head intervention, has practical utility for AI safety by enabling targeted control of model behavior and validating causal understanding of transformer mechanisms.

Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms’ decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans. This paper traces how attention head intervention emerged as a key method for causal interpretability of transformers. The evolution from visualization to intervention represents a paradigm shift from observing correlations to causally validating mechanistic hypotheses through direct intervention. Head intervention studies revealed robust empirical findings while also highlighting limitations that complicate interpretation. Recent work demonstrates that mechanistic understanding now enables targeted control of model behaviour, successfully suppressing toxic outputs and manipulating semantic content through selective attention head intervention, validating the practical utility of interpretability research for AI safety.

[63] The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel, Nikola Ljubešić

Main category: cs.CL

TL;DR: CLASSLA-web 2.0: A substantially larger web corpus collection for South Slavic languages (17B words, 38M texts) built through continuous crawling of national top-level domains, with new topic annotation but showing degradation from machine-generated content.

DetailsMotivation: To build on the success of previous national top-level domain crawling for South Slavic languages by establishing continuous crawling infrastructure and creating larger, more comprehensive web corpora with additional metadata.

Method: Established continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs, automatically annotating texts with both genre categories and topic labels.

Result: Created CLASSLA-web 2.0 with 17.0 billion words in 38.1 million texts across seven languages (Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, Slovenian), with only 20% overlap with previous version, but showing degradation from machine-generated sites.

Conclusion: Continuous crawling yields substantially new content but reveals growing pains with web quality degradation due to machine-generated sites, highlighting the need for quality filtering in web corpus construction.

Abstract: Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

[64] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation

Zhexiong Liu, Diane Litman

Main category: cs.CL

TL;DR: Paper ID 2602.00477 could not be fetched due to HTTP 429 error (rate limiting), so no abstract content is available for analysis.

DetailsMotivation: Unable to determine motivation due to missing abstract content.

Method: Unable to determine method due to missing abstract content.

Result: Unable to determine results due to missing abstract content.

Conclusion: Unable to draw conclusions due to missing abstract content.

Abstract: Failed to fetch summary for 2602.00477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[65] COMI: Coarse-to-fine Context Compression via Marginal Information Gain

Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to API rate limiting preventing access to paper content

Method: Cannot determine method due to API rate limiting preventing access to paper content

Result: Cannot determine results due to API rate limiting preventing access to paper content

Conclusion: Cannot determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.01719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[66] Read As Human: Compressing Context via Parallelizable Close Reading and Skimming

Jiwei Tang, Shilei Liu, Zhicheng Zhang, Qingsong Lv, Runsong Zhao, Tingwei Lu, Langming Liu, Haibin Chen, Yujin Yuan, Hai-Tao Zheng, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: RAM is a context compression framework for LLMs that uses adaptive hybrid reading (close reading important segments, skimming less relevant ones) to improve efficiency in long-context scenarios.

DetailsMotivation: LLMs struggle with long-context scenarios due to computational inefficiency and redundant information. Current approaches need better methods to handle lengthy inputs while maintaining performance.

Method: RAM partitions context into segments, encodes them in parallel with queries, fully retains high-relevance segments (close reading), compresses low-relevance segments into summary vectors (skimming), and uses contrastive learning to refine relevance decisions.

Result: Outperforms existing baselines on multiple QA and summarization benchmarks across two backbones, achieving up to 12x end-to-end speedup on long inputs (avg 16K, max 32K length).

Conclusion: RAM effectively addresses long-context challenges through human-inspired adaptive reading strategies, balancing performance and efficiency while maintaining interpretability.

Abstract: Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).

[67] LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction – A Case Study on SDGs

Yikai Zeng, Yingchao Piao, Changhua Pei, Jianhui Li

Main category: cs.CL

TL;DR: LEC-KG is a bidirectional collaborative framework that integrates LLMs’ semantic understanding with KGE’s structural reasoning to construct domain-specific knowledge graphs from unstructured text, addressing challenges like heterogeneous entities and long-tail relations.

DetailsMotivation: Domain-specific knowledge graph construction from unstructured text faces challenges including heterogeneous entity mentions, long-tail relation distributions, and lack of standardized schemas. Existing methods struggle with these issues, particularly for low-frequency relations and unseen entities.

Method: Three key components: (1) hierarchical coarse-to-fine relation extraction to mitigate long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization enabling structural validation for unseen entities. LLMs and KGE modules enhance each other iteratively.

Result: Evaluated on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. The framework reliably transforms unstructured policy text into validated knowledge graph triples through iterative refinement.

Conclusion: LEC-KG successfully integrates semantic and structural reasoning to address domain-specific KG construction challenges, showing particular strength in handling long-tail relations and improving extraction quality through bidirectional collaboration between LLMs and KGE.

Abstract: Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.

[68] Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Savan Doshi

Main category: cs.CL

TL;DR: Proposes a risk-sensitive evaluation framework for medical LLM hallucinations that quantifies potential harm rather than just factual correctness, focusing on risk-bearing language like treatment directives and contraindications.

DetailsMotivation: Existing hallucination standards focus primarily on factual correctness, treating all errors as equally severe, which obscures clinically relevant failure modes when models generate unsupported but actionable medical language that could cause harm if acted upon.

Method: Develops a risk-sensitive evaluation framework that quantifies hallucinations through presence of risk-bearing language (treatment directives, contraindications, urgency cues, high-risk medications), combines risk scoring with relevance measure to identify high-risk, low-grounding failures, and applies it to three instruction-tuned LLMs using controlled patient-facing safety stress test prompts.

Result: Models with similar surface-level behavior exhibit substantially different risk profiles, and standard evaluation metrics fail to capture these distinctions, showing the importance of risk-sensitive evaluation.

Conclusion: Risk sensitivity must be incorporated into hallucination evaluation for medical LLMs, and evaluation validity depends critically on task and prompt design to properly assess potential harm.

Abstract: Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.

[69] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese

Trung Tien Cao, Lam Minh Thai, Nghia Hieu Nguyen, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: A new Vietnamese multiple-choice reading comprehension dataset and method (ViMultiChoice) that jointly predicts answers and generates explanations, achieving SotA performance.

DetailsMotivation: Existing MCRC models lack explanation capabilities, and there's a need for Vietnamese-specific reading comprehension datasets and methods that can provide reasoning behind answer choices.

Method: Introduces ViMultiChoice, a novel method specifically designed for Vietnamese reading comprehension that jointly trains option decision and explanation generation tasks.

Result: ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art performance on both ViMMRC 2.0 benchmark and the newly introduced dataset, with joint training significantly improving multiple-choice accuracy.

Conclusion: The proposed approach successfully addresses the explanation gap in Vietnamese MCRC models, demonstrating that joint training of answer prediction and explanation generation improves overall performance.

Abstract: Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.

[70] HLE-Verified: A Systematic Verification and Structured Revision of Humanity’s Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xander Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang, Zichao Chen, Jianan Ye, Yijie Hu, Jialong Chen, Zongwen Shen, Yuliang Xu, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Hu Wei, Que Shen, Bing Zhao

Main category: cs.CL

TL;DR: HLE-Verified is a cleaned and verified version of the Humanity’s Last Exam benchmark, addressing annotation noise through expert validation and repair protocols to enable more accurate model evaluation.

DetailsMotivation: The original HLE benchmark contains noisy items that bias evaluation results and distort cross-model comparisons, necessitating a verified version for more reliable assessment of language model capabilities.

Method: Two-stage validation-and-repair workflow: Stage I involves binary validation of problems and answers through domain-expert review and model-based cross-checks; Stage II revises flawed but fixable items through dual independent expert repairs, model-assisted auditing, and final adjudication.

Result: Created HLE-Verified with 668 verified items, 1,143 revised-and-certified items, and 689 uncertain items; evaluation of 8 state-of-the-art models shows 7-10 percentage point accuracy gains on HLE-Verified, with 30-40 point gains on previously erroneous items.

Conclusion: HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities, with strong association between model confidence and error presence supporting revision effectiveness.

Abstract: Humanity’s Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7–10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30–40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified

[71] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.CL

TL;DR: OpenRS is a rubrics-based LLM-as-a-Judge framework that replaces scalar reward models with explicit reasoning processes using adaptive rubrics and meta-rubrics for robust alignment in non-verifiable tasks.

DetailsMotivation: Scalar reward models create information bottlenecks leading to brittleness and reward hacking in open-ended alignment. The paper argues that robust alignment requires explicit reasoning processes under inspectable principles rather than learned functions internalized into judges.

Method: OpenRS uses Pairwise Adaptive Meta-Rubrics (PAMR) and Pointwise Verifiable Rubrics (PVRs) with an explicit meta-rubric specification. It instantiates adaptive rubrics by conditioning on semantic differences between candidate responses, performs criterion-wise pairwise comparisons, and aggregates preferences externally. Includes two-level meta-rubric refinement pipeline and integrates as reward supervision in pairwise RL training.

Result: The framework avoids pointwise weighted scalarization while improving discriminability in open-ended settings. It provides both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available.

Conclusion: OpenRS offers a principled approach to alignment by making reward reasoning explicit and inspectable, addressing limitations of scalar reward models through rubrics-based judgment with verifiable components and adaptive meta-rubrics.

Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric – a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced – and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.

[72] Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Junbo Zhao, Sheng Guo, Haobo Wang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.12113 appears to be from February 2024 based on the arXiv identifier format.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.12113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[73] Modeling Distinct Human Interaction in Web Agents

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

Main category: cs.CL

TL;DR: Paper introduces modeling of human intervention in web agents, collects CowCorpus dataset of 400 web navigation trajectories, identifies 4 interaction patterns, trains LMs to predict interventions, and shows 26.5% increase in agent usefulness.

DetailsMotivation: Current autonomous web agents lack principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. Need for better human-agent collaboration in web tasks.

Method: 1) Collected CowCorpus dataset of 400 real-user web navigation trajectories with 4,200+ interleaved human/agent actions. 2) Identified four interaction patterns: hands-off supervision, hands-on oversight, collaborative task-solving, full user takeover. 3) Trained language models to anticipate user interventions based on interaction styles.

Result: Models achieved 61.4-63.4% improvement in intervention prediction accuracy over base LMs. When deployed in live web navigation agents, user study showed 26.5% increase in user-rated agent usefulness.

Conclusion: Structured modeling of human intervention leads to more adaptive, collaborative web agents that better understand when and why users intervene during task execution.

Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents – hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

[74] Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu, Penglei Sun, Jian Guo, Yongqi Zhang, Xiaowen Chu

Main category: cs.CL

TL;DR: Janus-Q is an event-driven trading framework that uses financial news events as primary decision units, combining event-centric data construction with decision-oriented fine-tuning to improve trading performance.

DetailsMotivation: Financial market movements are driven by discrete financial events in news, but existing approaches struggle with: 1) lack of large-scale event-centric datasets linking news semantics to market reactions, and 2) misalignment between language model reasoning and financially valid trading behavior under dynamic conditions.

Method: Two-stage approach: Stage I builds a large-scale financial news event dataset (62,400 articles) with fine-grained event types, stock associations, sentiment labels, and event-driven cumulative abnormal returns. Stage II uses decision-oriented fine-tuning combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM) to capture trade-offs among multiple trading objectives.

Result: Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving Sharpe Ratio by up to 102.0% and increasing direction accuracy by over 17.5% compared to strongest competing strategies.

Conclusion: The paper presents a successful framework that elevates financial news events from auxiliary signals to primary decision units, demonstrating superior trading performance through event-centric data construction and decision-oriented model optimization.

Abstract: Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.

[75] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi

Main category: cs.CL

TL;DR: Analysis of attention heads in multilingual Transformers reveals Retrieval-Transition Heads (RTH) that govern language switching and are crucial for multilingual Chain-of-Thought reasoning.

DetailsMotivation: Previous work identified retrieval heads in Transformers that retrieve information from context, but their role in multilingual contexts and cross-lingual settings remains unexplored. Understanding how multilingual language models handle language transitions is important for improving their reasoning capabilities.

Method: Investigated retrieval heads in multilingual language models, identified shared retrieval heads across languages, discovered Retrieval-Transition Heads (RTH) that control transitions to target-language outputs, and conducted experiments across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, XQuaD) and two model families (Qwen-2.5, Llama-3.1) comparing effects of masking RTH vs. retrieval heads.

Result: Found that retrieval heads are often shared across multiple languages in multilingual models, identified distinct RTHs that are more vital for Chain-of-Thought reasoning than regular retrieval heads, and demonstrated that masking RTH induces bigger performance drops than masking retrieval heads across all benchmarks and model families.

Conclusion: The work advances understanding of multilingual language models by isolating attention heads responsible for mapping to target languages, revealing specialized RTHs that govern language transitions and are crucial for multilingual reasoning tasks.

Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

[76] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu, Shu Xu, Jiaqi Wu, Jiayu Zhang, Xinpeng Liu, Xin Gui, Jingyi Cao, Piaohong Wang, Dingfeng Shi, He Zhu, Tiannan Wang, Yuqing Wang, Maojia Song, Tianyu Zheng, Ge Zhang, Jian Yang, Jiaheng Liu, Minghao Liu, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: SMTL framework improves research agent efficiency by replacing sequential reasoning with parallel evidence acquisition, achieving state-of-the-art performance while reducing reasoning steps by 70.7%.

DetailsMotivation: Current deep research agents suffer from high inference costs and latency due to scaling reasoning depth, and struggle with generalization across heterogeneous research settings.

Method: Proposes Search More, Think Less (SMTL) framework with parallel evidence acquisition for efficient context management, plus unified data synthesis pipeline for training across task types using supervised fine-tuning and reinforcement learning.

Result: Achieves strong performance across benchmarks: BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), DeepResearch Bench (45.9%), while reducing reasoning steps by 70.7% compared to Mirothinker-v1.0.

Conclusion: SMTL framework successfully addresses efficiency and generalization challenges in long-horizon agentic search through parallel evidence acquisition and unified training.

Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6%), GAIA (75.7%), Xbench (82.0%), and DeepResearch Bench (45.9%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7%, while improving accuracy.

[77] Aletheia tackles FirstProof autonomously

Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[78] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Tianlong Chen, Lu Yin, Shiwei Liu

Main category: cs.CL

TL;DR: NAP proposes a data-centric approach to enable genuinely non-autoregressive parallel generation in Diffusion Language Models by curating multiple independent reasoning trajectories and using parallel-forced decoding to encourage multi-token parallel updates.

DetailsMotivation: Current Diffusion Language Models (DLMs) often converge to autoregressive-like decoding despite being advertised as enabling parallel token generation. This sequential bottleneck limits their ability to exploit parallel hardware and improve latency scaling with output length. The authors identify a mismatch between DLM objectives and the sequential structure of standard training data as the primary driver of this AR-like behavior.

Method: NAP (Non-Autoregressive Parallel DLMs) is a data-centric approach that curates training examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy. This alignment of supervision with non-AR parallel decoding encourages multi-token parallel updates rather than sequential generation.

Result: Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long chain-of-thought data. The performance gains grow as parallelism increases, demonstrating improved non-autoregressive generation capabilities.

Conclusion: Revisiting data and supervision is a principled direction for mitigating autoregressive-like behavior in DLMs and moving toward genuinely non-autoregressive parallel generation. The proposed NAP approach shows promising results in enabling more efficient parallel decoding.

Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR’s sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

cs.CV

[79] PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wenbiao Yan, Ning Yang, Chaowei Fang, Juanjuan Zhao, Jihua Zhu, Conghui He, Cheng Tan

Main category: cs.CV

TL;DR: PointCoT introduces explicit Chain-of-Thought reasoning for 3D point cloud understanding in MLLMs, addressing geometric hallucinations through a “Look, Think, then Answer” paradigm with hierarchical CoT annotations.

DetailsMotivation: Current MLLMs excel in 2D scenes but struggle with 3D point cloud understanding. Existing approaches treat geometric reasoning as implicit mapping, leading to geometric hallucinations where models generate plausible but structurally incorrect responses. There's a need for explicit reasoning that grounds responses in precise structural details.

Method: Proposes PointCoT framework with explicit Chain-of-Thought reasoning for 3D data using a “Look, Think, then Answer” paradigm. Creates Point-Reason-Instruct benchmark with ~86k instruction-tuning samples featuring hierarchical CoT annotations. Uses dual-stream multi-modal architecture to synergize semantic appearance with geometric truth.

Result: PointCoT achieves state-of-the-art performance on complex reasoning tasks, demonstrating improved geometric understanding and reduced hallucinations compared to previous approaches.

Conclusion: Explicit Chain-of-Thought reasoning is effective for 3D point cloud understanding in MLLMs, addressing geometric hallucinations and improving performance on complex reasoning tasks through structured reasoning processes.

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.

[80] GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu, Beier Zhu, Junfeng Fang, Shuo Wang, Yin Zhang, Xiang Wang, Xiangnan He

Main category: cs.CV

TL;DR: GuardAlign is a training-free defense framework for large vision-language models that improves safety through optimal transport-based detection and cross-modal attention calibration.

DetailsMotivation: Current input-side defenses for LVLMs suffer from inaccurate detection in complex scenes and unstable safety signals during decoding, creating safety vulnerabilities.

Method: Two strategies: 1) OT-enhanced safety detection using optimal transport to measure distribution distances between image patches and unsafe semantics, 2) cross-modal attentive calibration that adaptively reallocates attention across layers to maintain safety signal activation.

Result: Reduces unsafe response rates by up to 39% on SPA-VL benchmark while preserving utility, with VQAv2 performance improving from 78.51% to 79.21% across six representative MLLMs.

Conclusion: GuardAlign provides an effective training-free defense that enhances safety in vision-language models without compromising their utility through improved detection and attention mechanisms.

Abstract: Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.

[81] DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Varun Gopal, Rishabh Jain, Aradhya Mathur, Nikitha SR, Sohan Patnaik, Sudhir Yarram, Mayur Hemani, Balaji Krishnamurthy, Mausoom Sarkar

Main category: cs.CV

TL;DR: DesignSense-10k: A large-scale human-annotated preference dataset and vision-language reward model for evaluating graphic layout quality, addressing the gap in layout-specific aesthetic judgment.

DetailsMotivation: Existing layout generation models often fail to align with human aesthetic judgment, and current preference datasets/reward models from text-to-image generation don't generalize to layout evaluation where spatial arrangement is crucial.

Method: Created DesignSense-10k dataset with 10,235 human-annotated preference pairs using a five-stage curation pipeline (semantic grouping, layout prediction, filtering, clustering, VLM-based refinement). Trained DesignSense, a vision-language model-based classifier for layout evaluation.

Result: DesignSense outperforms existing open-source and proprietary models by 54.6% in Macro F1. Frontier VLMs fail catastrophically on four-class layout evaluation. Using DesignSense in RL training improves generator win rate by ~3%, and inference-time scaling provides 3.6% improvement.

Conclusion: Specialized, layout-aware preference modeling is essential for improving layout generation quality, as general VLMs are unreliable for layout evaluation tasks.

Abstract: Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.

[82] Modelling and Simulation of Neuromorphic Datasets for Anomaly Detection in Computer Vision

Mike Middleton, Teymoor Ali, Hakan Kayan, Basabdatta Sen Bhattacharya, Charith Perera, Oliver Rhodes, Elena Gheorghiu, Mark Vousden, Martin A. Trefzer

Main category: cs.CV

TL;DR: ANTShapes is a Unity-based simulation framework for generating synthetic neuromorphic vision datasets with configurable 3D scenes and anomalous object behaviors for event-based computer vision research.

DetailsMotivation: Limited availability of Dynamic Vision Sensors (DVS) and existing neuromorphic vision datasets restricts research in event-based computer vision applications, necessitating a comprehensive simulation tool.

Method: Built in Unity engine, simulates abstract 3D scenes with objects having randomly-generated behaviors (motion, rotation). Uses statistical sampling based on central limit theorem principles for anomalous behavior labeling. Allows parameter adjustment for generating arbitrary-sized datasets with labels and frame data.

Result: ANTShapes enables creation of bespoke synthetic datasets for event-based computer vision research, addressing data scarcity issues for applications like object recognition, localization, and anomaly detection.

Conclusion: The framework provides a flexible solution to neuromorphic vision data limitations, allowing researchers to generate customized datasets for various event-based computer vision tasks.

Abstract: Limitations on the availability of Dynamic Vision Sensors (DVS) present a fundamental challenge to researchers of neuromorphic computer vision applications. In response, datasets have been created by the research community, but often contain a limited number of samples or scenarios. To address the lack of a comprehensive simulator of neuromorphic vision datasets, we introduce the Anomalous Neuromorphic Tool for Shapes (ANTShapes), a novel dataset simulation framework. Built in the Unity engine, ANTShapes simulates abstract, configurable 3D scenes populated by objects displaying randomly-generated behaviours describing attributes such as motion and rotation. The sampling of object behaviours, and the labelling of anomalously-acting objects, is a statistical process following central limit theorem principles. Datasets containing an arbitrary number of samples can be created and exported from ANTShapes, along with accompanying label and frame data, through the adjustment of a limited number of parameters within the software. ANTShapes addresses the limitations of data availability to researchers of event-based computer vision by allowing for the simulation of bespoke datasets to suit purposes including object recognition and localisation alongside anomaly detection.

[83] MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection

Haochen Zhao, Yuyao Kong, Yongxiu Xu, Gaopeng Gou, Hongbo Xu, Yubin Wang, Haoliang Zhang

Main category: cs.CV

TL;DR: MMSD3.0 is a new multi-image sarcasm detection benchmark with Cross-Image Reasoning Model (CIRM) that captures inter-image connections and uses relevance-guided cross-modal fusion.

DetailsMotivation: Existing multimodal sarcasm detection focuses on single-image scenarios, missing semantic and affective relations across multiple images that trigger sarcasm in real-world settings.

Method: Introduces MMSD3.0 benchmark with multi-image samples from tweets and Amazon reviews, and proposes CIRM with cross-image sequence modeling and relevance-guided fine-grained cross-modal fusion based on text-image correspondence.

Result: MMSD3.0 is an effective benchmark reflecting real-world conditions, and CIRM achieves state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0 datasets.

Conclusion: The work addresses the gap in multi-image sarcasm detection and provides a comprehensive benchmark with effective modeling techniques for both single and multi-image scenarios.

Abstract: Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios. Dataset and code are publicly available at https://github.com/ZHCMOONWIND/MMSD3.0.

[84] All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark

Junjiang Wu, Liejun Wang, Zhiqing Guo

Main category: cs.CV

TL;DR: A unified proactive forensics framework called LIDMark that jointly addresses deepfake detection, tampering localization, and source tracing using a 152-dimensional landmark-identity watermark and specialized decoder heads.

DetailsMotivation: Existing deepfake forensics methods treat detection, localization, and source tracing as independent tasks, lacking a unified framework. With rapid advancement of deepfake technology posing threats to privacy and security, there's a need for comprehensive forensic solutions.

Method: Proposes LIDMark framework with 152-dimensional watermark interweaving facial landmarks with source identifier. Uses Factorized-Head Decoder (FHD) with two specialized heads: regression head for landmark reconstruction (enabling detection/localization via intrinsic-extrinsic consistency check) and classification head for source identifier decoding (enabling tracing).

Result: Extensive experiments show the framework provides unified, robust, and imperceptible solution for detection, localization, and tracing of deepfake content. The code is publicly available.

Conclusion: LIDMark offers an “all-in-one” trifunctional forensic solution that jointly addresses deepfake detection, tampering localization, and source tracing in a unified framework, overcoming limitations of existing independent-task approaches.

Abstract: With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an “all-in-one” trifunctional forensic solution: the regression head underlies an “intrinsic-extrinsic” consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at https://github.com/vpsg-research/LIDMark.

[85] Enhancing CLIP Robustness via Cross-Modality Alignment

Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang

Main category: cs.CV

TL;DR: COLA is a training-free optimal transport framework that improves adversarial robustness of vision-language models by restoring cross-modal alignment between image and text features under attacks.

DetailsMotivation: Vision-language models like CLIP show strong zero-shot classification but are vulnerable to adversarial attacks. Existing methods focus on adversarial fine-tuning or prompt optimization but overlook feature misalignment issues where text and image features become separated under attacks, degrading performance.

Method: COLA uses optimal transport to address adversarial misalignment: (1) projects adversarial image embeddings onto subspace spanned by class text features to filter non-semantic distortions, (2) models images/texts as discrete distributions over augmented views and refines alignment via optimal transport with subspace projection integrated into cost computation.

Result: Extensive evaluations on 14 zero-shot classification benchmarks show COLA improves robustness, with average 6.7% improvement on ImageNet and variants under PGD attacks while maintaining accuracy on clean samples. The method is training-free and compatible with existing fine-tuned models.

Conclusion: COLA effectively addresses adversarial misalignment in vision-language models through optimal transport-based cross-modality alignment, improving robustness without requiring additional training and maintaining performance on clean data.

Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

[86] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna

Main category: cs.CV

TL;DR: SVG2 is a large-scale panoptic video scene graph dataset with 636K videos, and TRaSER is a video scene graph generation model that improves relation detection, object prediction, and attribute prediction over baselines.

DetailsMotivation: To address the lack of large-scale, diverse panoptic video scene graph datasets for training robust video understanding models, and to develop effective methods for generating explicit spatio-temporal scene graphs as intermediate representations for video understanding tasks.

Method: Created SVG2 dataset using automated pipeline with multi-scale panoptic segmentation, trajectory tracking, semantic parsing, and GPT-5-based relation inference. Developed TRaSER model with trajectory-aligned token arrangement, object-trajectory resampler, and temporal-window resampler to generate spatio-temporal scene graphs from raw videos.

Result: TRaSER improves relation detection by +15-20%, object prediction by +30-40% over open-source baselines (+13% over GPT-5), and attribute prediction by +15%. When used for video QA, it provides +1.5-4.6% accuracy gain over video-only or Qwen2.5-VL scene graphs.

Conclusion: SVG2 provides a valuable large-scale resource for video scene graph research, and TRaSER demonstrates the utility of explicit spatio-temporal scene graphs as intermediate representations for improving video understanding tasks like question answering.

Abstract: We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER’s generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL’s generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

[87] LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification

Shawn Liang, Sahil Shah, Chengwei Zhou, SP Sharan, Harsh Goel, Arnab Sanyal, Sandeep Chinchali, Gourav Datta

Main category: cs.CV

TL;DR: LE-NeuS: A latency-efficient neuro-symbolic framework for long-form video QA that reduces inference latency from 90x to 10x while maintaining accuracy gains through adaptive sampling and batched proposition detection.

DetailsMotivation: Existing neuro-symbolic approaches to long-form video QA achieve significant accuracy improvements but suffer from prohibitive latency overheads (up to 90x slower than base VLM prompting), making them impractical for latency-sensitive edge deployments.

Method: Two key optimizations: (1) CLIP-guided two-stage adaptive sampling that skips semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretical latency bounds are derived based on video length, proposition complexity, and sampling density.

Result: On LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.

Conclusion: LE-NeuS demonstrates that neuro-symbolic temporal reasoning can be made practical for real-world deployments through principled latency optimizations, preserving accuracy benefits while dramatically reducing inference time.

Abstract: Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.

[88] No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

Cho-Ying Wu, Zixun Huang, Xinyu Huang, Liu Ren

Main category: cs.CV

TL;DR: A method for cross-sensor view synthesis that eliminates the need for cumbersome calibration between RGB and other sensors (X-sensors) by using a match-densify-consolidate approach with 3D Gaussian Splatting.

DetailsMotivation: Existing RGB-X research assumes aligned sensor pairs exist and focuses on modality fusion, but obtaining such aligned data requires huge engineering effort in calibration. The paper addresses this practical bottleneck in cross-sensor learning by proposing a scalable solution for view synthesis without calibration.

Method: Proposes a match-densify-consolidate method: 1) RGB-X image matching followed by guided point densification, 2) confidence-aware densification and self-matching filtering for better view synthesis, 3) consolidation in 3D Gaussian Splatting (3DGS). Uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB.

Result: The method aims to remove cumbersome calibration for various RGB-X sensors and advance cross-sensor learning by providing a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.

Conclusion: This work presents the first study of cross-sensor view synthesis across different modalities, offering a practical solution to the widely overlooked problem of obtaining aligned RGB-X data without expensive calibration efforts.

Abstract: We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.

[89] Evidential Neural Radiance Fields

Ruxiao Duan, Alex Wong

Main category: cs.CV

TL;DR: Evidential Neural Radiance Fields (ENeRF) is a probabilistic approach that integrates with NeRF to quantify both aleatoric and epistemic uncertainty in a single forward pass, achieving state-of-the-art scene reconstruction and uncertainty estimation.

DetailsMotivation: Neural radiance fields (NeRFs) achieve impressive 3D scene reconstruction but lack uncertainty estimation, limiting deployment in safety-critical applications. Existing methods fail to capture both types of uncertainty or compromise rendering quality/computational efficiency.

Method: Introduces Evidential Neural Radiance Fields (ENeRF), a probabilistic approach that seamlessly integrates with NeRF rendering process. Uses evidential learning to quantify both aleatoric (data noise) and epistemic (model uncertainty) from a single forward pass without compromising quality.

Result: Demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality on three standardized benchmarks. Achieves both high rendering quality and efficient uncertainty quantification.

Conclusion: ENeRF provides a practical solution for trustworthy 3D scene modeling by enabling direct quantification of both uncertainty types in NeRFs without sacrificing rendering quality or computational efficiency.

Abstract: Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.

[90] CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation

Jeongbin Hong, Dooseop Choi, Taeg-Hyun An, Kyounghwan An, Kyoung-Wook Min

Main category: cs.CV

TL;DR: CycleBEV: A regularization framework using cycle consistency to enhance view transformation models for BEV semantic segmentation in autonomous driving.

DetailsMotivation: Transforming image features from perspective view to bird's-eye-view remains challenging due to depth ambiguity and occlusion. Existing view transformation paradigms still face difficulties in capturing accurate semantic and geometric information.

Method: Proposes CycleBEV with an inverse view transformation network that maps BEV segmentation back to PV segmentation. Uses cycle consistency losses during training, extending to geometric and representation spaces. The IVT network is only used during training, not inference.

Result: Consistent improvements across four representative VT models on nuScenes dataset: gains up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes respectively, without increasing inference complexity.

Conclusion: CycleBEV effectively enhances existing view transformation models for BEV semantic segmentation through cycle consistency regularization, improving performance while maintaining inference efficiency.

Abstract: Transforming image features from perspective view (PV) space to bird’s-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements – with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively – without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.

[91] Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Abhishek Dalvi, Vasant Honavar

Main category: cs.CV

TL;DR: HDFLIM is a framework that aligns frozen vision and language foundation models using hyperdimensional computing, enabling cross-modal mapping without parameter updates.

DetailsMotivation: Traditional multimodal alignment requires computationally intensive fine-tuning that modifies pretrained representations. The authors hypothesize that independently trained foundation models may already have latent semantic compatibility, and seek to achieve cross-modal alignment without modifying the models themselves.

Method: Projects unimodal embeddings into a shared hyperdimensional space and uses lightweight symbolic operations (binding, bundling, similarity-based retrieval) to construct associative cross-modal representations. Caption generation emerges from high-dimensional memory retrieval rather than gradient-based optimization.

Result: Achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines.

Conclusion: Semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings, pointing toward an alternative paradigm for foundation model alignment through structured representational mappings rather than large-scale retraining.

Abstract: Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations – binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek-Dalvi410/HDFLIM.

[92] Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Hiroshi Sasaki

Main category: cs.CV

TL;DR: A new training paradigm enhances diagram comprehension in vision-language models using pseudo contrastive samples generated by a diagram renderer, improving sensitivity to fine-grained structural variations in diagrams.

DetailsMotivation: Multimodal models like CLIP struggle with domains where small visual differences carry large semantic significance, such as diagram understanding, due to limited sensitivity to fine-grained structural variations.

Method: Introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences without modifying original data, and are incorporated into the training objective to improve structural sensitivity.

Result: Substantial improvements over standard CLIP and hard-negative CLIP training on benchmark flowchart datasets for both image-text matching and visual question answering tasks.

Conclusion: Domain-specific training strategies are valuable for advancing diagrammatic understanding within vision-language learning, and the proposed approach effectively enhances sensitivity to fine-grained structural variations in diagrams.

Abstract: Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models’ limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.

[93] Incremental dimension reduction for efficient and accurate visual anomaly detection

Teng-Yok Lee

Main category: cs.CV

TL;DR: Incremental dimension reduction algorithm for visual anomaly detection that processes features in batches using truncated SVD to handle large datasets efficiently.

DetailsMotivation: Deep neural networks extract high-dimensional features for visual anomaly detection, but this creates computational challenges for large datasets with thousands of images. Existing methods struggle with memory constraints when processing all features at once.

Method: Proposes an incremental dimension reduction algorithm that groups feature vectors into batches. For each batch, computes truncated singular value decomposition (SVD) to update singular values and vectors representing all visited vectors. Reduces each batch using its own singular values/vectors for memory efficiency, then re-transforms batch-wise singular vectors to the space spanned by all features’ singular vectors.

Result: The algorithm accelerates training of state-of-the-art anomaly detection algorithms while maintaining close accuracy to batch processing methods.

Conclusion: The incremental approach enables efficient processing of large visual datasets for anomaly detection by overcoming memory limitations while preserving detection accuracy.

Abstract: While nowadays visual anomaly detection algorithms use deep neural networks to extract salient features from images, the high dimensionality of extracted features makes it difficult to apply those algorithms to large data with 1000s of images. To address this issue, we present an incremental dimension reduction algorithm to reduce the extracted features. While our algorithm essentially computes truncated singular value decomposition of these features, other than processing all vectors at once, our algorithm groups the vectors into batches. At each batch, our algorithm updates the truncated singular values and vectors that represent all visited vectors, and reduces each batch by its own singular values and vectors so they can be stored in the memory with low overhead. After processing all batches, we re-transform these batch-wise singular vectors to the space spanned by the singular vectors of all features. We show that our algorithm can accelerate the training of state-of-the-art anomaly detection algorithm with close accuracy.

[94] Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao

Main category: cs.CV

TL;DR: HART is a framework that enables Large Multimodal Models to focus on key regions in high-resolution images without requiring expensive human annotations, using self-verification and reinforcement learning techniques.

DetailsMotivation: Current LMMs struggle with high-resolution visual inputs due to quadratic increase in image tokens, introducing redundancy. Existing methods require costly human annotations for visual grounding, and there's a need for annotation-free approaches to enhance models' grounding abilities for reasoning.

Method: Proposes HART (High-resolution Annotation-free Reasoning Technique), a closed-loop framework with post-training paradigm using Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions through self-verification.

Result: HART improves performance across wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to Qwen2.5-VL-7B, it surpasses larger models like Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.

Conclusion: HART enables LMMs to effectively handle high-resolution visual inputs without expensive annotations, providing explainable reasoning pathways and efficient optimization of localization, demonstrating strong performance improvements.

Abstract: Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model’s grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.

[95] Egocentric Visibility-Aware Human Pose Estimation

Peng Dai, Yu Zhang, Yiqiang Feng, Zhen Fan, Yang Zhang

Main category: cs.CV

TL;DR: Eva-3M: Large-scale egocentric visibility-aware human pose estimation dataset with 3M+ frames and 435K visibility annotations, plus EvaPose method that explicitly incorporates visibility information to improve pose estimation accuracy.

DetailsMotivation: Existing egocentric human pose estimation methods overlook keypoint invisibility issues, treating visible and invisible keypoints indiscriminately, which compromises their accuracy. No existing datasets provide keypoint visibility annotations.

Method: Created Eva-3M dataset with 3.0M frames (435K with visibility labels) and augmented EMHI dataset with visibility annotations. Proposed EvaPose method that explicitly incorporates visibility information to enhance pose estimation.

Result: Extensive experiments validate the value of ground-truth visibility labels in egocentric HPE settings. EvaPose achieves state-of-the-art performance on both Eva-3M and EMHI datasets.

Conclusion: Visibility-aware approaches significantly improve egocentric human pose estimation accuracy. The Eva-3M dataset and EvaPose method advance research in this direction by addressing the previously overlooked keypoint invisibility problem.

Abstract: Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.

[96] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

Main category: cs.CV

TL;DR: DeepLookEditBench (DLEBench) is the first benchmark for evaluating Instruction-based Image Editing Models on small-scale object editing, featuring 1889 samples with objects occupying only 1-10% of image area and complex scenarios like occlusion.

DetailsMotivation: Current IIEMs show strong reasoning on existing benchmarks but their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images.

Method: Constructed DLEBench with 1889 samples across 7 instruction types where target objects occupy only 1-10% of image area. Proposed evaluation protocol with refined score rubrics for Instruction Following and Visual Consistency, plus dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) to address misalignment between LMM-as-a-Judge and human judgments.

Result: Empirical evaluation of 10 IIEMs revealed significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Conclusion: DLEBench addresses a critical gap in evaluating IIEMs for small object editing, providing a challenging testbed and robust evaluation protocol that reveals current model limitations and guides future improvements in precise image editing.

Abstract: Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

[97] BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

Tongyan Hua, Haoran Gong, Yuan Liu, Di Wang, Ying-Cong Chen, Wufan Zhao

Main category: cs.CV

TL;DR: BuildAnyPoint is a generative framework for structured 3D building reconstruction from diverse point clouds using Loosely Cascaded Diffusion Transformers for distribution recovery and autoregressive mesh generation.

DetailsMotivation: The paper addresses the challenge of reconstructing structured 3D buildings from point clouds with diverse distributions (like airborne LiDAR and Structure-from-Motion). Current methods struggle with the highly underconstrained nature of this problem, particularly in recovering artist-created building abstractions from noisy or sparse point data.

Method: The method introduces BuildAnyPoint with two main components: 1) A Loosely Cascaded Diffusion Transformer (Loca-DiT) that first recovers the underlying distribution from noisy/sparse points, formulated as a conditional generation task using latent diffusion models. 2) A decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds, enabling structured building abstraction.

Result: The method shows substantial qualitative and quantitative improvements over prior building abstraction methods. Additionally, the recovered point clouds demonstrate strong performance on building point cloud completion benchmarks, exhibiting improved surface accuracy and distribution uniformity.

Conclusion: BuildAnyPoint effectively addresses the challenge of structured 3D building reconstruction from diverse point clouds by combining diffusion-based distribution recovery with autoregressive mesh generation, outperforming existing methods and showing promise for practical applications in building abstraction.

Abstract: We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion. To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation. Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes. We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds. Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.

[98] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Haowen Zhu, Ning Yin, Xiaogen Zhou

Main category: cs.CV

TL;DR: MedMAP is a medical modality-aware pretraining framework for 3D MRI that improves vision-language alignment and cross-modal fusion for multi-organ abnormality detection.

DetailsMotivation: Applying vision-language models to multi-organ medical imaging faces challenges with modality-specific vision-language alignment and cross-modal feature fusion, especially in 3D MRI contexts.

Method: Two-stage framework: (1) modality-aware vision-language alignment pre-training that captures joint modality distributions, and (2) fine-tuning vision encoders for downstream tasks while keeping text encoder frozen. Uses MedMoM-MRI3D dataset with 7,392 3D MRI volume-report pairs across 12 MRI modalities and 9 abnormalities.

Result: MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection on the MedMoM-MRI3D benchmark.

Conclusion: The proposed modality-aware pretraining framework effectively addresses alignment and fusion challenges in medical VLMs, demonstrating strong performance for 3D MRI analysis tasks.

Abstract: Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.

[99] ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models

Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan, Zhiquan Wen, Mingkui Tan

Main category: cs.CV

TL;DR: ProtoDCS: A robust open-set test-time adaptation framework for vision-language models that separates covariate-shifted in-distribution and out-of-distribution samples using probabilistic verification and prototype-level updates.

DetailsMotivation: Current VLM-based test-time adaptation methods fail in open-set scenarios where test streams contain both covariate-shifted in-distribution and out-of-distribution data, leading to interference and overconfident predictions. Existing methods use brittle thresholding and computationally expensive parameter updates.

Method: Proposes Prototype-based Double-Check Separation (ProtoDCS) with: (1) double-check separation using probabilistic Gaussian Mixture Model verification instead of hard thresholds, and (2) evidence-driven adaptation with uncertainty-aware loss and efficient prototype-level updates.

Result: Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C show state-of-the-art performance, significantly improving both known-class accuracy and OOD detection metrics.

Conclusion: ProtoDCS effectively addresses open-set test-time adaptation challenges for VLMs by providing robust separation of ID/OOD samples and efficient adaptation, overcoming limitations of existing methods.

Abstract: Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at https://github.com/O-YangF/ProtoDCS.

[100] Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering

Ao Li, Rui Liu, Mingjie Li, Sheng Liu, Lei Wang, Xiaodan Liang, Lina Yao, Xiaojun Chang, Lei Xing

Main category: cs.CV

TL;DR: A training-free inference-time control framework called Semantically Decoupled Latent Steering (SDLS) that reduces prior-comparison hallucinations in radiology report generation by using LLM-driven semantic decomposition and QR-based orthogonalization to create semantic-free intervention vectors.

DetailsMotivation: Automated radiology report generation using vision-language models suffers from prior-comparison hallucination, where models generate historical findings not supported by current studies. Existing methods face a trade-off between hallucination suppression and clinical accuracy.

Method: SDLS constructs semantic-free intervention vectors via LLM-driven semantic decomposition followed by QR-based orthogonalization. This filters out clinical semantics entangled in PCA directions, ensuring steering targets only the “historical comparison” axis without retraining.

Result: On BiomedGPT, SDLS reduces historical hallucinations (FilBERT score from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 from 0.2242 to 0.3208) on MIMIC-CXR. Zero-shot transfer to CheXpert Plus and IU-Xray shows robustness.

Conclusion: SDLS effectively addresses prior-comparison hallucination in radiology report generation without retraining, overcoming the trade-off between hallucination suppression and clinical accuracy through geometric constraints and semantic decoupling.

Abstract: Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.

[101] Multi-illuminant Color Constancy via Multi-scale Illuminant Estimation and Fusion

Hang Luo, Rongwei Li, Jinxing Liang

Main category: cs.CV

TL;DR: A multi-scale approach for multi-illuminant color constancy using tri-branch CNN with attention-based fusion of multi-grained illuminant maps.

DetailsMotivation: Existing deep learning methods for multi-illuminant color constancy directly map images to illumination maps but neglect the impact of image scales, which is important for accurate pixel-wise illuminant estimation.

Method: Proposes representing illuminant maps as linear combinations of components from multi-scale images. Uses tri-branch convolution networks to estimate multi-grained illuminant distribution maps from multi-scale images, then merges them adaptively with an attentional illuminant fusion module.

Result: Comprehensive experimental analysis demonstrates the method’s effectiveness and achieves state-of-the-art performance in multi-illuminant color constancy.

Conclusion: The multi-scale approach with attention-based fusion of multi-grained illuminant maps significantly improves multi-illuminant color constancy performance by better handling scale variations in images.

Abstract: Multi-illuminant color constancy methods aim to eliminate local color casts within an image through pixel-wise illuminant estimation. Existing methods mainly employ deep learning to establish a direct mapping between an image and its illumination map, which neglects the impact of image scales. To alleviate this problem, we represent an illuminant map as the linear combination of components estimated from multi-scale images. Furthermore, we propose a tri-branch convolution networks to estimate multi-grained illuminant distribution maps from multi-scale images. These multi-grained illuminant maps are merged adaptively with an attentional illuminant fusion module. Through comprehensive experimental analysis and evaluation, the results demonstrate the effectiveness of our method, and it has achieved state-of-the-art performance.

[102] Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation

Nazia Hossain, Xintong Jiang, Yu Tian, Philippe Seguin, O. Grant Clark, Shangpeng Sun

Main category: cs.CV

TL;DR: VL-WS is a vision-language framework for fine-grained crop-weed segmentation that uses CLIP embeddings and natural language conditioning to improve generalization across diverse agricultural environments.

DetailsMotivation: Existing deep learning models for crop-weed segmentation struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features, limiting practical deployment in precision agriculture.

Method: Proposes VL-WS with dual-encoder design: frozen CLIP embeddings fused with task-specific spatial features, modulated via FiLM layers conditioned on natural language captions. Trained on unified corpus including ground and UAV imagery across diverse conditions.

Result: Achieves mean Dice score of 91.64%, outperforming CNN baseline by 4.98%. Largest gains on most challenging weed class: 80.45% vs 65.03% (15.42% improvement). Maintains stable performance under limited target-domain supervision.

Conclusion: Vision-language alignment enables scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains, demonstrating improved generalization and data efficiency.

Abstract: Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.

[103] Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand

Dingqi Ye, Daniel Kiv, Wei Hu, Jimeng Shi, Shaowen Wang

Main category: cs.CV

TL;DR: rs-embed is a Python library that provides a unified interface for accessing and comparing remote sensing foundation model embeddings across different formats and platforms.

DetailsMotivation: The remote sensing community faces challenges in adopting and fairly comparing foundation models due to heterogeneity in model release formats, platforms, interfaces, and input data specifications, which increases costs for obtaining, using, and benchmarking embeddings.

Method: Developed a Python library with a unified, region-of-interest (ROI) centric interface that allows users to retrieve embeddings from any supported model for any location and time range with a single line of code, plus efficient batch processing for large-scale operations.

Result: Created an open-source library (rs-embed) that simplifies access to remote sensing foundation model embeddings, enabling easier comparison and adoption across different models and platforms.

Conclusion: rs-embed addresses practical adoption barriers in remote sensing foundation models by providing a standardized interface that reduces complexity and cost for embedding retrieval and comparison.

Abstract: The remote sensing community is witnessing a rapid growth of foundation models, which provide powerful embeddings for a wide range of downstream tasks. However, practical adoption and fair comparison remain challenging due to substantial heterogeneity in model release formats, platforms and interfaces, and input data specifications. These inconsistencies significantly increase the cost of obtaining, using, and benchmarking embeddings across models. To address this issue, we propose rs-embed, a Python library that offers a unified, region of interst (ROI) centric interface: with a single line of code, users can retrieve embeddings from any supported model for any location and any time range. The library also provides efficient batch processing to enable large-scale embedding generation and evaluation. The code is available at: https://github.com/cybergis/rs-embed

[104] Beyond Ground: Map-Free LiDAR Relocalization for UAVs

Hengyu Mu, Jianshi Wu, Yuxin Guo, XianLian Lin, Qingyong Hu, Sheng Ao, Chenglu Wen, Cheng Wang

Main category: cs.CV

TL;DR: MAILS is a map-free LiDAR relocalization framework for UAVs that addresses challenges of sparse point clouds, yaw rotations, and altitude variations using novel attention mechanisms and feature encoding, with a new UAV-specific dataset.

DetailsMotivation: Existing LiDAR relocalization methods are designed for autonomous driving and perform poorly in UAV scenarios due to different flight characteristics like irregular trajectories, altitude variations, and sparse point clouds. There's also a lack of appropriate datasets capturing real UAV flight patterns.

Method: Proposes MAILS framework with: 1) Locality-Preserving Sliding Window Attention for extracting discriminative geometric features from sparse point clouds, 2) Coordinate-independent feature initialization module, 3) Locally invariant positional encoding mechanism to handle yaw rotations and altitude variations, and 4) A new large-scale UAV LiDAR localization dataset with four scenes and various flight trajectories.

Result: Extensive experiments show the method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. The new dataset enables proper evaluation of UAV relocalization under realistic conditions.

Conclusion: MAILS provides an effective map-free LiDAR relocalization solution for UAVs that addresses the unique challenges of UAV flight scenarios, with both methodological innovations and a valuable new dataset for the research community.

Abstract: Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.

[105] Towards Source-Aware Object Swapping with Initial Noise Perturbation

Jiahui Zhan, Xianbing Sun, Xiangnan Zhu, Yikun Ji, Ruitong Liu, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: SourceSwap is a self-supervised framework for object swapping that learns cross-object alignment without requiring per-object finetuning, paired data, or videos.

DetailsMotivation: Existing object swapping methods either require slow per-object finetuning or rely on paired data showing the same object in different contexts, which forces models to use background cues rather than learning true cross-object alignment.

Method: Uses frequency-separated perturbation in initial-noise space to synthesize pseudo pairs from any image, altering appearance while preserving pose, shape, and scene layout. Trains a dual U-Net with full-source conditioning and noise-free reference encoder for direct inter-object alignment.

Result: Achieves superior fidelity, stronger scene preservation, and more natural harmony compared to existing methods. Enables zero-shot inference without per-object finetuning and transfers well to related editing tasks like subject-driven refinement and face swapping.

Conclusion: SourceSwap provides an effective self-supervised framework for object swapping that learns cross-object alignment through synthesized pseudo pairs, enabling high-quality, zero-shot object replacement with better scene preservation and harmony.

Abstract: Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.

[106] HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit

Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, Xiaoyu Shen

Main category: cs.CV

TL;DR: HiDrop: A framework for efficient MLLM training that reduces quadratic vision token computational cost by 90% while maintaining performance, using hierarchical token pruning aligned with multimodal fusion patterns.

DetailsMotivation: The quadratic computational cost of processing vision tokens in MLLMs limits their adoption. Current progressive token pruning methods misinterpret shallow layer functions and use rigid schedules, failing to unlock full efficiency potential.

Method: HiDrop aligns token pruning with MLLM’s hierarchical function: 1) Late Injection bypasses passive shallow layers to introduce visual tokens where active fusion begins; 2) Concave Pyramid Pruning with Early Exit dynamically adjusts pruning rates across middle/deep layers using inter-layer similarity and differentiable top-k operator. Includes persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation.

Result: Compresses about 90% visual tokens while matching original performance, accelerates training by 1.72 times, sets new SOTA for efficient MLLM training/inference.

Conclusion: HiDrop provides an efficient MLLM framework that maintains performance while significantly reducing computational cost, offering insights into hierarchical nature of multimodal fusion.

Abstract: The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.

[107] EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding

Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, Jifei Song

Main category: cs.CV

TL;DR: EgoGraph is a training-free framework that constructs dynamic knowledge graphs from ultra-long egocentric videos to capture long-term, cross-entity dependencies for better video understanding.

DetailsMotivation: Existing video understanding approaches struggle with ultra-long egocentric videos spanning multiple days due to fragmented local processing and limited temporal modeling, restricting their ability to reason over extended sequences.

Method: EgoGraph uses a novel egocentric schema to unify extraction and abstraction of core entities (people, objects, locations, events) and structurally reasons about their attributes and interactions. It employs temporal relational modeling to capture dependencies across entities and accumulate stable long-term memory over multiple days.

Result: EgoGraph achieves state-of-the-art performance on long-term video question answering benchmarks (EgoLifeQA and EgoR1-bench), demonstrating effectiveness for ultra-long egocentric video understanding.

Conclusion: EgoGraph presents a new paradigm for ultra-long egocentric video understanding through training-free dynamic knowledge graph construction that captures rich semantic representations and enables complex temporal reasoning.

Abstract: Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.

[108] Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

Hongbo Jiang, Jie Li, Yunhang Shen, Pingyang Dai, Xing Sun, Haoyu Cao, Liujuan Cao

Main category: cs.CV

TL;DR: U-MLLMs fail to maintain semantic equivalence across modalities despite strong textual reasoning and visual rendering capabilities, revealing a breakdown in cross-modal semantic alignment rather than generation fidelity issues.

DetailsMotivation: Current evaluations of Unified Multimodal Large Language Models (U-MLLMs) assess understanding and generation separately, overlooking semantic equivalence - the ability to produce consistent reasoning results regardless of output modality. The authors investigate whether current U-MLLMs truly satisfy this fundamental premise of unified multimodal models.

Method: Introduces VGUBench, a diagnostic framework with three tasks: 1) Textual Generative Understanding (baseline textual reasoning), 2) Visual Generative Understanding (generating visual answers), and 3) Visual Rendering control task (rendering explicit descriptions without reasoning). This framework decouples reasoning logic from generation fidelity to diagnose cross-modal alignment issues.

Result: U-MLLMs show strong performance in textual understanding and visual rendering, but exhibit significant performance collapse when required to generate visual answers to questions. There’s negligible correlation between visual answering performance and basic rendering quality, indicating the failure stems from cross-modal semantic alignment breakdown rather than generation fidelity issues.

Conclusion: The semantic equivalence premise of U-MLLMs is not satisfied - models fail to maintain consistent reasoning across modalities despite strong individual capabilities. The breakdown occurs in cross-modal semantic alignment, not generation quality. The findings provide diagnostic insights for improving future unified multimodal models.

Abstract: Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.

[109] A Difference-in-Difference Approach to Detecting AI-Generated Images

Xinyi Qi, Kai Ye, Chengchun Shi, Ying Yang, Hongyi Zhou, Jin Zhu

Main category: cs.CV

TL;DR: A novel difference-in-difference method for detecting AI-generated images by computing second-order differences in reconstruction error rather than first-order differences, improving detection accuracy as AI-generated images become more realistic.

DetailsMotivation: As diffusion models produce AI-generated images nearly indistinguishable from real ones, existing detectors based on reconstruction error become less effective. There's a need for more robust detection methods to address potential misuse of generative AI.

Method: Proposes a difference-in-difference approach that computes the difference in reconstruction error (second-order difference) rather than directly using reconstruction error (first-order difference). This variance reduction technique improves detection accuracy by focusing on more subtle patterns.

Result: Extensive experiments demonstrate strong generalization performance, enabling reliable detection of AI-generated images even as they become increasingly similar to real images.

Conclusion: The proposed second-order difference method provides a more effective approach for detecting AI-generated images in the era of advanced generative AI, addressing limitations of traditional reconstruction error-based detectors.

Abstract: Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones. This raises concerns about their potential misuse and poses substantial challenges for detecting them. Many existing detectors rely on reconstruction error – the difference between the input image and its reconstructed version – as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error – a second-order difference – for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.

[110] UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen

Main category: cs.CV

TL;DR: UTPTrack introduces a unified token pruning framework for Transformer-based visual object trackers that jointly compresses search region, dynamic template, and static template tokens using attention-guided, token type-aware strategy to improve efficiency while maintaining accuracy.

DetailsMotivation: Existing token pruning methods for Transformer-based trackers are fragmented, pruning different components (search region, dynamic template, static template) in isolation without considering inter-component dependencies, leading to suboptimal pruning and degraded accuracy.

Method: UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy across all three components, enabling unified token pruning that supports both RGB-based tracking and multimodal/language-guided tracking within a single model.

Result: UTPTrack achieves state-of-the-art accuracy-efficiency trade-off, pruning 65.4% of vision tokens in RGB tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance respectively, as demonstrated on 10 benchmarks.

Conclusion: UTPTrack provides a robust foundation for efficient visual tracking research with strong performance across both RGB and multimodal scenarios, demonstrating the effectiveness of unified token pruning for Transformer-based trackers.

Abstract: One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.

[111] U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

Main category: cs.CV

TL;DR: U-Mind is a unified multimodal dialogue system that jointly models language, speech, motion, and video synthesis in real-time using a unified alignment and reasoning framework with segment-wise alignment and rehearsal-driven learning.

DetailsMotivation: Existing multimodal systems are limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions needed for intelligent embodied agents.

Method: U-Mind implements a Unified Alignment and Reasoning Framework with segment-wise alignment strategy for cross-modal synchronization and Rehearsal-Driven Learning to preserve reasoning abilities. It uses text-first decoding with internal chain-of-thought planning followed by temporally synchronized generation across modalities, plus a real-time video rendering framework conditioned on pose and speech.

Result: U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks including question answering, instruction following, and motion generation, demonstrating effective real-time multimodal synthesis.

Conclusion: U-Mind represents a significant step toward intelligent, immersive conversational agents by enabling high-intelligence multimodal dialogue with real-time generation across language, speech, motion, and video within a single interactive loop.

Abstract: Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.

[112] Learning Accurate Segmentation Purely from Self-Supervision

Zuyao You, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Selfment: A fully self-supervised framework for object segmentation without any manual annotations, pretrained models, or post-processing, achieving state-of-the-art results on multiple benchmarks.

DetailsMotivation: The core challenge in computer vision is accurately segmenting objects without manual annotations. Current methods often rely on human labels, pretrained models, or post-processing, which limits their applicability and generalization.

Method: Selfment constructs patch-level affinity graphs from self-supervised features and applies Normalized Cut (NCut) for initial foreground-background separation. It then introduces Iterative Patch Optimization (IPO) for feature-space refinement through iterative patch clustering to enforce spatial coherence and semantic consistency. Finally, refined masks train a lightweight segmentation head with contrastive and region-consistency objectives.

Result: Selfment achieves substantial improvements over previous unsupervised methods: +4.0% F_max on ECSSD, +4.6% on HKUIS, and +5.7% on PASCAL-S. It also demonstrates remarkable zero-shot generalization to camouflaged object detection (0.910 S_m on CHAMELEON and 0.792 F_β^ω on CAMO), outperforming all unsupervised approaches and rivaling state-of-the-art supervised methods.

Conclusion: Selfment presents a simple yet effective fully self-supervised framework for object segmentation that achieves state-of-the-art performance across multiple benchmarks without any manual supervision, demonstrating strong generalization capabilities even to challenging tasks like camouflaged object detection.

Abstract: Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground–background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on $F_{\max}$ over previous unsupervised saliency detection methods on ECSSD ($+4.0%$), HKUIS ($+4.6%$), and PASCAL-S ($+5.7%$). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., $0.910$ $S_m$ on CHAMELEON and $0.792$ $F_β^ω$ on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.

[113] Diffusion Probe: Generated Image Result Prediction Using CNN Probes

Benlei Cui, Bukun Huang, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Jingqun Tang, Haiwen Hong

Main category: cs.CV

TL;DR: Diffusion Probe: A framework that predicts final image quality in text-to-image diffusion models using early cross-attention statistics, enabling efficient early quality assessment to reduce computational waste.

DetailsMotivation: Text-to-image diffusion models lack efficient early quality assessment mechanisms, leading to costly trial-and-error in multi-generation scenarios like prompt iteration and agent-based generation. Current approaches require full image synthesis before quality evaluation, wasting computational resources on low-quality outputs.

Method: The authors discovered a strong correlation between early diffusion cross-attention distributions and final image quality. They developed Diffusion Probe, a lightweight predictor that maps statistical properties of early-stage cross-attention maps (extracted from initial denoising steps) to predict final image quality across diverse metrics.

Result: Diffusion Probe achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9) across multiple T2I models, early denoising windows, resolutions, and quality metrics. It enables practical gains in workflows like prompt optimization, seed selection, and accelerated RL training.

Conclusion: Diffusion Probe provides a model-agnostic, efficient solution for early quality prediction in T2I generation, reducing computational overhead while improving final output quality through targeted sampling and avoidance of low-potential generations.

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image’s overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.

[114] Fourier Angle Alignment for Oriented Object Detection in Remote Sensing

Changyu Gu, Linwei Chen, Lin Gu, Ying Fu

Main category: cs.CV

TL;DR: Fourier Angle Alignment method for remote sensing rotated object detection using Fourier rotation equivariance to address directional incoherence and task conflict issues.

DetailsMotivation: Mainstream methods in remote sensing rotated object detection suffer from two bottlenecks: directional incoherence at detector neck and task conflict at detecting head, which limits performance.

Method: Proposes Fourier Angle Alignment that analyzes angle information through frequency spectrum and aligns main direction to certain orientation. Introduces two plug-and-play modules: FAAFusion (works at detector neck to align higher-level features to lower-level features) and FAA Head (pre-aligns RoI features to canonical angle before classification/regression).

Result: Achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training/testing. Greatly improves previous work on DOTA-v1.0, DOTA-v1.5 and HRSC2016 datasets.

Conclusion: Fourier Angle Alignment effectively addresses directional incoherence and task conflict in rotated object detection, validating efficacy in remote sensing object detection through superior performance on benchmark datasets.

Abstract: In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce Fourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : FAAFusion and FAA Head. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code is made publicly available at https://github.com/gcy0423/Fourier-Angle-Alignment .

[115] See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang

Main category: cs.CV

TL;DR: Sea² adapts frozen pre-trained perception models to novel environments through intelligent viewpoint control, using a VLM-based pose controller trained with rule-based exploration and unsupervised RL, achieving significant performance gains without downstream labels or model retraining.

DetailsMotivation: Pre-trained perception models degrade in novel environments like indoor scenes, and conventional fine-tuning causes catastrophic forgetting and requires costly annotations. There's a need for methods that can adapt to new environments without retraining models or needing labeled data.

Method: Sea² keeps perception modules frozen and adapts their deployment through an intelligent pose-control agent. It transforms a vision-language model into a low-level pose controller using a two-stage pipeline: 1) fine-tuning on rule-based exploration trajectories that systematically probe scenes, and 2) refining via unsupervised reinforcement learning that constructs rewards from perception module outputs and confidence.

Result: Experiments on three visual perception tasks (visual grounding, segmentation, and 3D box estimation) show performance improvements of 13.54%, 15.92%, and 27.68% respectively on the ReplicaCAD dataset, demonstrating effective adaptation without downstream labels or model retraining.

Conclusion: Sea² provides a paradigm shift by adapting how perception models are deployed rather than adapting the models themselves, enabling effective use of off-the-shelf models in novel environments without catastrophic forgetting or costly annotations.

Abstract: Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module’s outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.

[116] Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Chongyang Xu, Haipeng Li, Shen Cheng, Jingyu Hu, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

Main category: cs.CV

TL;DR: A bimanual manipulation framework that uses 3D geometric foundation models to predict both actions and future 3D scene geometry from RGB images, outperforming 2D and point-cloud baselines.

DetailsMotivation: Existing bimanual manipulation methods rely on limited 2D features or require explicit point clouds that are difficult to obtain reliably in real-world settings. Recent 3D geometric foundation models enable accurate 3D reconstruction from RGB images, creating an opportunity to build more robust manipulation policies.

Method: The framework leverages a pre-trained 3D geometric foundation model. It fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, then uses a diffusion model to jointly predict future action chunks and future 3D latents that decode into dense pointmaps.

Result: The method outperforms 2D-based and point-cloud-based baselines in simulation on the RoboTwin benchmark and in real-world robot executions, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy.

Conclusion: By explicitly predicting how 3D scenes evolve together with action sequences, the policy gains strong spatial understanding and predictive capability using only RGB observations, enabling more effective bimanual manipulation.

Abstract: Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.

[117] Footprint-Guided Exemplar-Free Continual Histopathology Report Generation

Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour, yousef Sadegheih, Priyankar Choudhary, Dorit Merhof

Main category: cs.CV

TL;DR: Continual learning framework for pathology report generation from whole-slide images using domain footprints for generative replay without storing exemplars.

DetailsMotivation: Clinical deployment of pathology report generation systems faces challenges with evolving data (new organs, institutions, reporting conventions) over time, and sequential fine-tuning causes catastrophic forgetting. Need for exemplar-free continual learning to adapt to evolving clinical settings.

Method: Uses compact domain footprints in frozen patch-embedding space: codebook of representative morphology tokens, slide-level co-occurrence summaries, and patch-count priors. These support generative replay by synthesizing pseudo-WSI representations. Distills domain-specific linguistic characteristics into style descriptors for steering generation. At inference, identifies compatible descriptor from slide signal without explicit domain IDs.

Result: Outperforms exemplar-free and limited-buffer rehearsal baselines across multiple public continual learning benchmarks. Demonstrates practical solution for deployment in evolving clinical settings.

Conclusion: Footprint-based generative replay enables effective continual learning for pathology report generation without storing past data, addressing catastrophic forgetting while adapting to evolving clinical domains and reporting conventions.

Abstract: Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.

[118] Denoising-Enhanced YOLO for Robust SAR Ship Detection

Xiaojing Zhao, Shiyang Li, Zena Chu, Ying Zhang, Peinan Hao, Tianzi Yan, Jiajia Chen, Huicong Ning

Main category: cs.CV

TL;DR: CPN-YOLO improves ship detection in SAR imagery using denoising, attention mechanisms, and Gaussian similarity loss to address clutter, noise, and small target challenges.

DetailsMotivation: SAR imagery is crucial for ship detection but suffers from clutter and speckle noise causing false alarms, and small targets being easily missed in complex scenes.

Method: Three improvements to YOLOv8: 1) learnable large-kernel denoising module for input preprocessing, 2) PPA attention mechanism for multi-scale feature enhancement, 3) Gaussian similarity loss based on normalized Wasserstein distance for better bounding box similarity measurement.

Result: Achieves 97.0% precision, 95.1% recall, and 98.9% mAP on SSDD dataset, outperforming YOLOv8 baseline and other deep learning detectors.

Conclusion: CPN-YOLO effectively addresses SAR ship detection challenges in complex scenes through targeted architectural improvements.

Abstract: With the rapid advancement of deep learning, synthetic aperture radar (SAR) imagery has become a key modality for ship detection. However, robust performance remains challenging in complex scenes, where clutter and speckle noise can induce false alarms and small targets are easily missed. To address these issues, we propose CPN-YOLO, a high-precision ship detection framework built upon YOLOv8 with three targeted improvements. First, we introduce a learnable large-kernel denoising module for input pre-processing, producing cleaner representations and more discriminative features across diverse ship types. Second, we design a feature extraction enhancement strategy based on the PPA attention mechanism to strengthen multi-scale modeling and improve sensitivity to small ships. Third, we incorporate a Gaussian similarity loss derived from the normalized Wasserstein distance (NWD) to better measure similarity under complex bounding-box distributions and improve generalization. Extensive experiments on HRSID and SSDD demonstrate the effectiveness of our method. On SSDD, CPN-YOLO surpasses the YOLOv8 baseline, achieving 97.0% precision, 95.1% recall, and 98.9% mAP, and consistently outperforms other representative deep-learning detectors in overall performance.

[119] APPO: Attention-guided Perception Policy Optimization for Video Reasoning

Henghui Du, Chang Zhou, Xi Chen, Di Hu

Main category: cs.CV

TL;DR: APPO is an attention-guided perception policy optimization algorithm that uses token-level dense rewards to enhance fine-grained perception in video reasoning models, showing perception improvements are more critical than reasoning enhancements for complex video tasks.

DetailsMotivation: The paper identifies that complex video reasoning relies more on fine-grained perception than expert-level reasoning. Empirical evidence shows perception improvements (model scaling from 7B to 32B) yield 1.4% performance boost, while reasoning enhancements (Qwen3-8B to OpenAI-o3) only provide 0.7% improvement. The goal is to enhance perception ability through reasoning without expensive fine-grained annotations.

Method: Proposes APPO (Attention-guided Perception Policy Optimization) algorithm that leverages token-level dense rewards to improve fine-grained perception. The core idea optimizes intra-group perception tokens - tokens from different responses that focus on the same crucial video frame. Uses attention mechanisms to guide perception policy optimization.

Result: Experimental results on diverse video benchmarks with different model scales (3B/7B) show APPO consistently outperforms GRPO and DAPO by 0.5% to 4%. Demonstrates effectiveness in enhancing perception abilities through reasoning in a low-cost manner.

Conclusion: APPO provides a promising approach to effectively enhance model’s perception abilities through reasoning without expensive fine-grained annotations, serving diverse scenarios and demands in video understanding tasks.

Abstract: Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model’s fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model’s perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.

[120] NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection

Xiaoyu Guo, Arkaitz Zubiaga

Main category: cs.CV

TL;DR: Multi-modal multi-task model for detecting AI-generated images and identifying their source models using BERT and CLIP encoders with cross-modal fusion and pseudo-labeling data augmentation.

DetailsMotivation: To address the growing challenge of detecting AI-generated images and identifying the specific generative models that created them, which is important for content verification and authenticity assessment in real-world scenarios.

Method: Uses pre-trained BERT for text feature extraction and CLIP Vision encoder for image features, with cross-modal feature fusion and a tailored multi-task loss function. Implements pseudo-labeling-based data augmentation to expand training data with high-confidence samples.

Result: Achieved 5th place in both Tasks A (detection) and B (model identification) of the CT2 competition, with F1 scores of 83.16% and 48.88% respectively, demonstrating the effectiveness of the proposed architecture.

Conclusion: The proposed multi-modal multi-task approach shows promise for advancing AI-generated content detection in practical applications, with the architecture effectively leveraging both visual and textual information for improved performance.

Abstract: With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection’ competition, with F1 scores of 83.16% and 48.88%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.

[121] Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner

Main category: cs.CV

TL;DR: ReSeg-CLIP: Training-free Open-Vocabulary Semantic Segmentation for remote sensing using SAM masks and model composition

DetailsMotivation: Address limitations of vision-language models like CLIP in semantic segmentation for remote sensing data, particularly inappropriate interactions within self-attention layers

Method: Hierarchical scheme using SAM-generated masks to constrain attention interactions at multiple scales; model composition averaging parameters of multiple RS-specific CLIP variants with weighting scheme using varying text prompts

Result: Achieves state-of-the-art results across three remote sensing benchmarks without additional training

Conclusion: Proposed training-free method effectively addresses CLIP limitations for remote sensing segmentation through hierarchical mask constraints and model composition

Abstract: In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.

[122] Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles

Faisal Hawladera, Rui Meireles, Gamal Elghazaly, Ana Aguiar, Raphaël Frank

Main category: cs.CV

TL;DR: Hybrid cloud-vehicle processing system for autonomous driving that uses V2X communication to offload transformer-based BEV 3D object detection computation, achieving 72% latency reduction and 20% accuracy improvement through dynamic optimization.

DetailsMotivation: Autonomous driving requires real-time situational awareness but faces latency issues due to high processing demands and limited onboard compute resources, especially in complex urban environments.

Method: Leverages V2X communication to offload processing to cloud; uses transformer models to fuse multi-camera data into BEV representation; dynamically splits computation between vehicle and cloud based on layers processed locally and feature quantization; applies feature vector clipping and compression to reduce network load.

Result: 72% reduction in end-to-end latency compared to traditional onboard solution; dynamic optimization algorithm improves accuracy by up to 20% over static parameterization under realistic bandwidth variability.

Conclusion: Hybrid cloud-vehicle processing with V2X communication effectively addresses latency constraints in autonomous driving while maintaining high detection accuracy through adaptive optimization.

Abstract: A key challenge for autonomous driving lies in maintaining real-time situational awareness regarding surrounding obstacles under strict latency constraints. The high processing requirements coupled with limited onboard computational resources can cause delay issues, particularly in complex urban settings. To address this, we propose leveraging Vehicle-to-Everything (V2X) communication to partially offload processing to the cloud, where compute resources are abundant, thus reducing overall latency. Our approach utilizes transformer-based models to fuse multi-camera sensor data into a comprehensive Bird’s-Eye View (BEV) representation, enabling accurate 360-degree 3D object detection. The computation is dynamically split between the vehicle and the cloud based on the number of layers processed locally and the quantization level of the features. To further reduce network load, we apply feature vector clipping and compression prior to transmission. In a real-world experimental evaluation, our hybrid strategy achieved a 72 % reduction in end-to-end latency compared to a traditional onboard solution. To adapt to fluctuating network conditions, we introduce a dynamic optimization algorithm that selects the split point and quantization level to maximize detection accuracy while satisfying real-time latency constraints. Trace-based evaluation under realistic bandwidth variability shows that this adaptive approach improves accuracy by up to 20 % over static parameterization with the same latency performance.

[123] Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu

Main category: cs.CV

TL;DR: Ref-Adv is a new Referring Expression Comprehension benchmark designed to suppress shortcuts and test genuine visual reasoning in multimodal LLMs by using linguistically complex expressions with minimal necessary information and hard distractors.

DetailsMotivation: Standard REC benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have limitations: expressions are too short with little reasoning demand, images have few distractors making targets easy to find, and redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning.

Method: Created Ref-Adv dataset with linguistically nontrivial expressions paired with only necessary information to uniquely identify targets, using real images with hard distractors, annotated with reasoning facets including negation. Conducted comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to validate reasoning requirements.

Result: Despite strong performance on standard REC benchmarks, contemporary multimodal LLMs show marked performance drops on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. Failure analysis provides insights into model limitations.

Conclusion: Ref-Adv exposes weaknesses in current multimodal LLMs’ visual reasoning capabilities and aims to guide future work on improving visual reasoning and grounding in MLLMs by suppressing shortcut solutions.

Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

[124] Altitude-Aware Visual Place Recognition in Top-Down View

Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng

Main category: cs.CV

TL;DR: Altitude-adaptive visual place recognition method for aerial platforms that uses ground feature density analysis to estimate relative altitude and improve localization accuracy under significant altitude variations.

DetailsMotivation: Address the challenge of aerial visual place recognition under significant altitude variations, where conventional methods relying on additional sensors (barometric altimeters, ToF sensors) are not suitable for small- and medium-sized airborne platforms.

Method: Proposes an altitude-adaptive VPR approach that: 1) estimates relative altitude by analyzing ground feature density in images, 2) applies altitude-based cropping to generate canonical query images, and 3) uses classification-based VPR strategy for localization.

Result: Method boosts average R@1 by 29.85% and R@5 by 60.20% compared to VPR alone. Reduces mean error by 202.1m compared to monocular metric depth estimation methods, with additional improvements of 31.4% in R@1 and 44% in R@5.

Conclusion: Establishes a robust, vision-only framework for 3D visual place recognition, offering a practical and scalable solution for accurate airborne platform localization under large altitude variations with limited sensor availability.

Abstract: To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms’ relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85% and 60.20%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4% in R@1 and 44% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.

[125] DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution

Xiaoyan Lei, Wenlong Zhang, Biao Luo, Hui Liang, Weifeng Cao, Qiuting Lin

Main category: cs.CV

TL;DR: A novel approach for real-world image super-resolution using a Real Embedding Extractor (REE) with degradation selection and a Mamba-based network with Conditional Feature Modulator (CFM) to improve recognition and restoration of degraded images.

DetailsMotivation: Multimodal large models have shown good performance in image super-resolution using language conditions, but their abilities remain limited for degraded images. The paper aims to address this limitation by improving recognition and restoration of degraded image content.

Method: 1) Analyze Recognize Anything Model (RAM) capabilities on degraded images via text similarity; 2) Propose Real Embedding Extractor (REE) with degradation selection strategy using contrastive learning; 3) Develop Conditional Feature Modulator (CFM) to incorporate REE’s high-level information into a Mamba-based network for texture restoration.

Result: Extensive experiments show REE effectively helps image super-resolution networks balance fidelity and perceptual quality, demonstrating significant recognition performance gain on degraded images and producing visually pleasing restoration results.

Conclusion: The proposed approach successfully addresses limitations of multimodal models on degraded images, highlighting the potential of Mamba-based networks in real-world applications for improved image super-resolution.

Abstract: Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git

[126] AoE: Always-on Egocentric Human Video Collection for Embodied AI

Bowen Yang, Zishuo Li, Yang Sun, Changtao Miao, Yifan Yang, Man Luo, Xiaotong Yan, Feng Jiang, Jinchuan Shi, Yankai Fu, Ning Chen, Junkai Zhao, Pengwei Wang, Guocai Yao, Shanghang Zhang, Hao Chen, Zhe Li, Kai Zhu

Main category: cs.CV

TL;DR: AoE system enables scalable egocentric data collection using smartphones and neck-mounted holders for embodied AI training, with cloud-edge processing for automated labeling.

DetailsMotivation: Current embodied foundation models lack scalable real-world interaction data due to high infrastructure costs and hardware dependencies. Human agents with smartphones offer low-cost, sustainable solution for global data collection.

Method: 1) Neck-mounted smartphone holder for egocentric capture, 2) Cloud-edge architecture with mobile app for real-time processing, 3) Automated cloud pipelines for labeling/filtering raw videos into training data, 4) Distributed collection system accessible to anyone.

Result: System enables high-quality egocentric data collection at scale. Evaluation shows data preprocessing quality and downstream task performance improvements, with egocentric data boosting real-world generalization.

Conclusion: AoE system provides scalable, low-cost solution for embodied AI data collection by leveraging human agents and smartphones, addressing data scarcity for foundation models through distributed egocentric video collection.

Abstract: Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed “human agents” offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.

[127] SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction

Xavier Timoneda, Markus Herb, Fabian Duerr, Daniel Goehring

Main category: cs.CV

TL;DR: Self-supervised 3D occupancy flow estimation for autonomous driving without human annotations or external flow supervision

DetailsMotivation: Existing methods for 3D occupancy and motion estimation rely on expensive annotations (3D occupancy/flow labels, velocity labels, pretrained optical flow models), which limits scalability and practical deployment.

Method: Proposes self-supervised method that disentangles scene into static and dynamic signed distance fields, learns motion implicitly through temporal aggregation, and introduces strong self-supervised flow cue using features’ cosine similarities.

Result: Demonstrates efficacy on SemanticKITTI, KITTI-MOT, and nuScenes datasets, showing competitive performance without requiring expensive annotations.

Conclusion: The method provides a practical solution for 3D occupancy flow estimation in autonomous driving by eliminating dependency on human annotations while maintaining competitive performance.

Abstract: Estimating 3D occupancy and motion at the vehicle’s surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features’ cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.

[128] Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh, Brandon Lee, Vipin Chaudhary, Gourav Datta

Main category: cs.CV

TL;DR: A neurosymbolic verification framework that audits vision-language models for logical consistency in radiology report generation, using SMT solvers to detect hallucinations and missing conclusions.

DetailsMotivation: Vision-language models for radiology report generation often produce logical inconsistencies where diagnostic impressions aren't supported by perceptual findings or miss logically entailed conclusions. Traditional lexical metrics fail to capture these deductive failures in reference-free settings.

Method: Introduced a neurosymbolic verification framework that autoformalizes free-text radiographic findings into structured propositional evidence, then uses an SMT solver (Z3) with a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted.

Result: Evaluated seven VLMs across five chest X-ray benchmarks, exposing distinct reasoning failure modes like conservative observation and stochastic hallucination that traditional metrics miss. Enforcing solver-backed entailment systematically eliminated unsupported hallucinations, significantly increasing diagnostic soundness and precision.

Conclusion: The neurosymbolic verification framework provides rigorous post-hoc guarantees for clinical reasoning in VLM-generated reports, addressing logical consistency issues that lexical metrics cannot detect, thereby improving reliability of generative clinical assistants.

Abstract: Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.

[129] Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

Pramit Saha, Mohammad Alsharid, Joshua Strong, J. Alison Noble

Main category: cs.CV

TL;DR: BUSD-Agent: A cascaded multi-agent framework for breast ultrasound screening and diagnosis that uses experience-guided retrieval to reduce unnecessary diagnostic escalations and biopsy referrals through selective decision-making.

DetailsMotivation: To reduce diagnostic escalation and unnecessary biopsy referrals in breast ultrasound screening by creating an intelligent system that can selectively filter cases based on risk assessment, similar to how experienced clinicians work.

Method: Two-stage cascaded multi-agent framework: 1) Lightweight screening clinic agent filters benign/normal cases using classification models, 2) Diagnostic clinic agent handles escalated cases with richer perception tools. Uses memory bank of past pathology-confirmed cases with decision trajectories for retrieval-conditioned in-context adaptation.

Result: Reduced diagnostic escalation from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08% compared to same architecture without trajectory conditioning. Improved average screening specificity by 68.48% and diagnostic specificity by 6.33% across 10 breast ultrasound datasets.

Conclusion: Experience-guided retrieval enables dynamic adjustment of model trust and escalation thresholds without parameter updates, significantly reducing unnecessary procedures while maintaining diagnostic accuracy in breast ultrasound screening.

Abstract: We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the diagnostic clinic’ agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology-confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD-Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent’s current decision policy. This enables retrieval-conditioned in-context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience-guided workflow reduces diagnostic escalation in BUSD-Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.

[130] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang, Bin Wang, Conghui He

Main category: cs.CV

TL;DR: AgenticOCR introduces a query-driven OCR system that dynamically extracts only relevant document regions instead of processing entire pages, improving visual document RAG efficiency and accuracy.

DetailsMotivation: Current multimodal RAG systems face bottlenecks with page-level chunking and retrieval, which overload generators with excessive context and dilute salient evidence. Compressing information-rich pages into limited visual tokens increases hallucination risks.

Method: Transforms OCR from static full-text processing to query-driven, on-demand extraction. Uses autonomous document layout analysis in a “thinking with images” manner to identify and selectively recognize regions of interest, performing on-demand decompression of visual tokens where needed.

Result: AgenticOCR improves both efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. It effectively decouples retrieval granularity from rigid page-level chunking.

Conclusion: AgenticOCR serves as a “third building block” for visual document RAG stacks alongside standard Embedding and Reranking modules, enabling dynamic parsing that enhances multimodal document understanding.

Abstract: The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator’s attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a “thinking with images” manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the “third building block” of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.

[131] SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation

Andrei-Alexandru Bunea, Dan-Matei Popovici, Radu Tudor Ionescu

Main category: cs.CV

TL;DR: SegMate is an efficient 2.5D medical image segmentation framework that achieves SOTA accuracy while significantly reducing computational requirements through asymmetric architectures, attention mechanisms, and multi-scale feature fusion.

DetailsMotivation: Current medical image segmentation models achieve high accuracy but require substantial computational resources, limiting deployment in resource-constrained clinical settings. There's a need for efficient models that maintain accuracy while reducing computational demands.

Method: A 2.5D framework integrating asymmetric architectures, attention mechanisms, multi-scale feature fusion, slice-based positional conditioning, and multi-task optimization. Tested with three backbones (EfficientNetV2-M, MambaOut-Tiny, FastViT-T12) on three medical datasets.

Result: Reduces computation by up to 2.5x and memory footprint by up to 2.1x while achieving ~1% performance gains. Achieves 93.51% Dice score on TotalSegmentator with only 295MB peak GPU memory. Strong generalization in zero-shot cross-dataset evaluations.

Conclusion: SegMate provides an efficient solution for medical image segmentation that balances accuracy and computational efficiency, enabling deployment in resource-constrained clinical environments.

Abstract: State-of-the-art models for medical image segmentation achieve excellent accuracy but require substantial computational resources, limiting deployment in resource-constrained clinical settings. We present SegMate, an efficient 2.5D framework that achieves state-of-the-art accuracy, while considerably reducing computational requirements. Our efficient design is the result of meticulously integrating asymmetric architectures, attention mechanisms, multi-scale feature fusion, slice-based positional conditioning, and multi-task optimization. We demonstrate the efficiency-accuracy trade-off of our framework across three modern backbones (EfficientNetV2-M, MambaOut-Tiny, FastViT-T12). We perform experiments on three datasets: TotalSegmentator, SegTHOR and AMOS22. Compared with the vanilla models, SegMate reduces computation (GFLOPs) by up to 2.5x and memory footprint (VRAM) by up to 2.1x, while generally registering performance gains of around 1%. On TotalSegmentator, we achieve a Dice score of 93.51% with only 295MB peak GPU memory. Zero-shot cross-dataset evaluations on SegTHOR and AMOS22 demonstrate strong generalization, with Dice scores of up to 86.85% and 89.35%, respectively. We release our open-source code at https://github.com/andreibunea99/SegMate.

[132] Half-Truths Break Similarity-Based Retrieval

Bora Kargi, Arnas Uselis, Seong Joon Oh

Main category: cs.CV

TL;DR: CS-CLIP addresses CLIP’s vulnerability to “half-truths” where adding plausible but incorrect details to descriptions increases similarity scores, by introducing component-level supervision during fine-tuning.

DetailsMotivation: CLIP-style models often violate intuitive similarity scoring: appending incorrect but plausible details to correct descriptions can increase similarity scores rather than decrease them, revealing weak supervision on caption components.

Method: Proposes CS-CLIP which decomposes captions into entity and relation units, constructs minimally edited foils for each unit, and fine-tunes CLIP to score correct units above foils while preserving standard dual-encoder inference.

Result: CS-CLIP raises half-truth accuracy from 40.6% to 69.3% on COCO and improves average performance on established compositional benchmarks by 5.7 points.

Conclusion: Reducing half-truth errors through component-level supervision improves compositional understanding in vision-language models while maintaining standard inference efficiency.

Abstract: When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP

[133] The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

Main category: cs.CV

TL;DR: A topology-driven framework for evaluating transferability of medical foundation models to segmentation tasks, outperforming existing metrics by 31% without requiring fine-tuning.

DetailsMotivation: Existing transferability estimation metrics for foundation models are designed for classification tasks and fail to capture the topological complexity needed for medical segmentation tasks, creating a computational bottleneck in model selection.

Method: Proposes a topology-driven framework with three components: 1) Global Representation Topology Divergence using Minimum Spanning Trees, 2) Local Boundary-Aware Topological Consistency for anatomical boundaries, and 3) Task-Adaptive Fusion that dynamically combines global and local metrics.

Result: Validated on the OpenMind benchmark across diverse anatomical targets and SSL foundation models, achieving around 31% relative improvement in weighted Kendall correlation compared to state-of-the-art baselines.

Conclusion: Provides a robust, training-free proxy for efficient model selection in medical segmentation without the computational cost of fine-tuning, addressing the bottleneck in selecting optimal medical foundation models.

Abstract: The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf{31%} relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

[134] Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction

Qiyu Feng, Jiwei Shan, Shing Shin Cheng, Hesheng Wang

Main category: cs.CV

TL;DR: GPU-SDF improves neural implicit surface reconstruction by explicitly estimating geometric prior uncertainty and using it to modulate prior influence, with complementary constraints for fine detail recovery.

DetailsMotivation: Existing neural implicit surface reconstruction methods struggle with fine details like thin structures due to unreliable geometric priors. Current approaches use implicit uncertainty filtering which is indirect and inefficient, and masking supervision in high-uncertainty regions leads to under-constrained optimization.

Method: Proposes GPU-SDF with: 1) self-supervised module that explicitly estimates prior uncertainty without auxiliary networks, 2) uncertainty-guided loss that modulates prior influence rather than discarding it, 3) edge distance field for boundary supervision, and 4) multi-view consistency regularization for geometric coherence.

Result: Extensive experiments confirm GPU-SDF improves reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks.

Conclusion: GPU-SDF addresses limitations in neural implicit surface reconstruction by leveraging geometric prior uncertainty and complementary constraints to better recover fine details and complex geometries.

Abstract: Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. Source code will be available at https://github.com/IRMVLab/GPU-SDF

[135] Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion

Mingjie Zhang, Bo Li, Wanting Liu, Hongyan Cui, Yue Li, Qingwen Li, Hong Li, Ge Gao

Main category: cs.CV

TL;DR: A dual-branch network with parallel attention for micro-expression recognition, combining residual and Inception networks with adaptive feature fusion to handle subtle facial movements.

DetailsMotivation: Micro-expressions are transient and subtle facial movements that challenge existing optical flow-based recognition methods, requiring more robust feature extraction approaches.

Method: Proposes a dual-branch micro-expression feature extraction network with parallel attention: 1) residual network to address gradient vanishing and network degradation, 2) Inception network to enhance representation and suppress irrelevant regions, 3) adaptive feature fusion module to integrate dual-branch features.

Result: Achieves 74.67% accuracy on CASME II dataset, outperforming LBP-TOP by 11.26% and MSMMT by 3.36%.

Conclusion: The proposed dual-branch network with parallel attention effectively addresses micro-expression recognition challenges and demonstrates superior performance compared to existing methods.

Abstract: Micro-expressions, characterized by transience and subtlety, pose challenges to existing optical flow-based recognition methods. To address this, this paper proposes a dual-branch micro-expression feature extraction network integrated with parallel attention. Key contributions include: 1) a residual network designed to alleviate gradient anishing and network degradation; 2) an Inception network constructed to enhance model representation and suppress interference from irrelevant regions; 3) an adaptive feature fusion module developed to integrate dual-branch features. Experiments on the CASME II dataset demonstrate that the proposed method achieves 74.67% accuracy, outperforming LBP-TOP (by 11.26%), MSMMT (by 3.36%), and other comparative methods.

[136] AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors

Xiaozhen Qiao, Wenjia Wang, Zhiyuan Zhao, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

Main category: cs.CV

TL;DR: AHAP is a feed-forward framework for reconstructing 3D humans from arbitrary camera perspectives without requiring camera calibration, using multi-view geometry fusion for human association, reconstruction, and localization.

DetailsMotivation: Traditional multi-view 3D human reconstruction requires pre-calibration (checkerboards or MVS algorithms), limiting scalability and applicability in diverse real-world scenarios. The authors aim to develop a calibration-free approach.

Method: Uses Cross-View Identity Association module with learnable person queries and soft assignment supervised by contrastive learning. Human Head fuses cross-view features and scene context for SMPL prediction with cross-view reprojection losses. Multi-view geometry eliminates depth ambiguity through triangulation.

Result: Achieves competitive performance on world-space human reconstruction and camera pose estimation on EgoHumans and EgoExo4D datasets, while being 180× faster than optimization-based approaches.

Conclusion: AHAP demonstrates effective calibration-free 3D human reconstruction from arbitrary perspectives by leveraging multi-view geometry fusion, offering practical advantages in speed and applicability.

Abstract: Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{AHAP} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches.

[137] CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

Yuyang Hong, Jiaqi Gu, Yujin Lou, Lubin Fan, Qi Yang, Ying Wang, Kun Ding, Yue Wu, Shiming Xiang, Jieping Ye

Main category: cs.CV

TL;DR: CC-VQA is a training-free method for knowledge-based visual question answering that addresses conflicts between static model knowledge and dynamically retrieved information through vision-centric conflict reasoning and correlation-guided encoding/decoding.

DetailsMotivation: Current KB-VQA methods suffer from conflicts between static parametric knowledge in VLMs and dynamically retrieved information, leading to outputs that either ignore retrieved contexts or exhibit inconsistent integration. Existing conflict mitigation methods are adapted from language-based approaches and neglect visual information while suffering from redundant retrieved contexts.

Method: Two core components: (1) Vision-Centric Contextual Conflict Reasoning performs visual-semantic conflict analysis across internal and external knowledge contexts; (2) Correlation-Guided Encoding and Decoding features positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring.

Result: Achieves state-of-the-art performance on E-VQA, InfoSeek, and OK-VQA benchmarks with absolute accuracy improvements of 3.3% to 6.4% compared to existing methods.

Conclusion: CC-VQA effectively addresses knowledge conflicts in KB-VQA by incorporating visual information into conflict analysis and using correlation-guided mechanisms, demonstrating significant improvements over existing approaches.

Abstract: Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3% to 6.4% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.

[138] GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting

Caner Beldek, Emre Sariyildiz, Son Lam Phung, Gursel Alici

Main category: cs.CV

TL;DR: GDA-YOLO11 amodal segmentation model improves robotic fruit harvesting by detecting occluded fruits and estimating picking points for 3D robotic execution.

DetailsMotivation: Occlusion in robotic fruit harvesting causes significant crop losses due to undetected or poorly localized fruits. Current methods struggle with occlusion handling.

Method: Proposed GDA-YOLO11 with architectural improvements and asymmetric mask loss for amodal instance segmentation. Uses Euclidean distance transform for picking point estimation and 3D projection for robotic execution.

Result: GDA-YOLO11 achieves precision 0.844, recall 0.846, mAP@50 0.914, outperforming YOLO11n. Harvesting success rates: 92.59% (zero occlusion) to 22.22% (high occlusion), with 3.5% improvement under medium/high occlusion.

Conclusion: GDA-YOLO11 enhances occlusion-robust segmentation and streamlines perception-to-action integration for more reliable autonomous agricultural systems.

Abstract: Occlusion remains a critical challenge in robotic fruit harvesting, as undetected or inaccurately localised fruits often results in substantial crop losses. To mitigate this issue, we propose a harvesting framework using a new amodal segmentation model, GDA-YOLO11, which incorporates architectural improvements and an updated asymmetric mask loss. The proposed model is trained on a modified version of a public citrus dataset and evaluated on both the base dataset and occlusion-sensitive subsets with varying occlusion levels. Within the framework, full fruit masks, including invisible regions, are inferred by GDA-YOLO11, and picking points are subsequently estimated using the Euclidean distance transform. These points are then projected into 3D coordinates for robotic harvesting execution. Experiments were conducted using real citrus fruits in a controlled environment simulating occlusion scenarios. Notably, to the best of our knowledge, this study provides the first practical demonstration of amodal instance segmentation in robotic fruit harvesting. GDA-YOLO11 achieves a precision of 0.844, recall of 0.846, mAP@50 of 0.914, and mAP@50:95 of 0.636, outperforming YOLO11n by 5.1%, 1.3%, and 1.0% in precision, mAP@50, and mAP@50:95, respectively. The framework attains harvesting success rates of 92.59%, 85.18%, 48.14%, and 22.22% at zero to high occlusion levels, improving success by 3.5% under medium and high occlusion. These findings demonstrate that GDA-YOLO11 enhances occlusion robust segmentation and streamlines perception-to-action integration, paving the way for more reliable autonomous systems in agriculture.

[139] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

Main category: cs.CV

TL;DR: SwitchCraft is a training-free framework for multi-event video generation that uses Event-Aligned Query Steering and Auto-Balance Strength Solver to improve temporal alignment and scene consistency in text-to-video diffusion models.

DetailsMotivation: Current text-to-video diffusion models are optimized for single-event generation and struggle with multi-event prompts, producing blended or collapsed scenes that break narrative coherence. There's a need for better temporal grounding in multi-event video generation.

Method: SwitchCraft introduces two main components: 1) Event-Aligned Query Steering (EAQS) that steers frame-level attention to align with relevant event prompts, and 2) Auto-Balance Strength Solver (ABSS) that adaptively balances steering strength to preserve temporal consistency and visual fidelity. The framework is training-free.

Result: Extensive experiments show SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared to existing baselines for multi-event video generation.

Conclusion: SwitchCraft offers a simple yet effective training-free solution for multi-event video generation by addressing temporal alignment challenges in text-to-video diffusion models.

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.

[140] Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang

Main category: cs.CV

TL;DR: NV-CoT enables multimodal LLMs to perform visual reasoning using continuous numerical coordinates instead of text coordinates or fixed patches, improving localization precision and answer accuracy.

DetailsMotivation: Existing MLLMs for region-grounded reasoning suffer from modality mismatch when using textified coordinates or limited precision with fixed-granularity patches, requiring architectural changes.

Method: Proposes Numerical Visual Chain-of-Thought (NV-CoT) that expands MLLM action space to continuous Euclidean space, allowing direct bounding-box coordinate generation with minimal architectural changes. Uses Gaussian/Laplace policies with reparameterized sampling for GRPO-style optimization.

Result: NV-CoT significantly improves localization precision and final answer accuracy across three benchmarks compared to eight baselines, while accelerating training convergence.

Conclusion: Continuous-action visual reasoning via numerical coordinates is effective for MLLMs, addressing limitations of existing region-grounded reasoning approaches.

Abstract: Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.

[141] SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking

Qiuyang Zhang, Jiujun Cheng, Qichao Mao, Cong Liu, Yu Fang, Yuhong Li, Mengying Ge, Shangce Gao

Main category: cs.CV

TL;DR: SpikeTrack: A spike-driven framework for energy-efficient RGB object tracking using SNNs with asymmetric design and memory-retrieval modules

DetailsMotivation: Existing SNN tracking frameworks don't fully align with spike-driven computation or leverage spatiotemporal dynamics, creating a trade-off between efficiency and accuracy in RGB visual tracking

Method: Uses asymmetric timestep expansion and unidirectional information flow with a memory-retrieval module that recurrently queries compact memory initialized by templates to retrieve target cues

Result: Achieves state-of-the-art among SNN-based trackers, competitive with advanced ANN trackers, surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy

Conclusion: SpikeTrack is the first spike-driven framework making RGB tracking both accurate and energy efficient, demonstrating practical viability of SNNs for vision tasks

Abstract: Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons’ spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient. The code and models are available at https://github.com/faicaiwawa/SpikeTrack.

[142] Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping

Tianxiang Du, Hulingxiao He, Yuxin Peng

Main category: cs.CV

TL;DR: AesGuide introduces a large-scale aesthetic guidance dataset and Venus framework to enhance MLLMs’ ability to provide actionable aesthetic feedback and cropping guidance for photography.

DetailsMotivation: There's a gap between ordinary users and professional photographers who can identify aesthetic issues and provide actionable shooting guidance. Existing MLLMs offer overly positive feedback without identifying issues or providing actionable guidance, lacking aesthetic guidance capability.

Method: Introduces AesGuide dataset with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Proposes Venus, a two-stage framework: 1) empowers MLLMs with aesthetic guidance through progressively complex aesthetic questions, 2) activates aesthetic cropping power via CoT-based rationales.

Result: Venus substantially improves aesthetic guidance capability and achieves state-of-the-art performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across photo creation stages.

Conclusion: The work addresses the underexplored domain of aesthetic guidance in computational aesthetics, providing both dataset and framework to enhance MLLMs’ ability to provide professional-level photographic feedback and cropping guidance.

Abstract: The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) – an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.

[143] Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

Kaiwen Zhu, Quansheng Zeng, Yuandong Pu, Shuo Cao, Xiaohui Li, Yi Xin, Qi Qin, Jiayang Li, Yu Qiao, Jinjin Gu, Yihao Liu

Main category: cs.CV

TL;DR: MIGM-Shortcut accelerates masked image generation models by learning a lightweight model that predicts feature evolution velocity, achieving 4x speedup while maintaining quality.

DetailsMotivation: Masked Image Generation Models (MIGMs) suffer from computational inefficiency due to multiple bi-directional attention steps, with existing acceleration methods showing significant approximation errors under aggressive acceleration rates.

Method: Proposes a lightweight model that incorporates both previous features and sampled tokens to regress the average velocity field of feature evolution, balancing expressivity with computational efficiency compared to the original base model.

Result: Applied to state-of-the-art Lumina-DiMOO, achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly improving the Pareto frontier of masked image generation.

Conclusion: MIGM-Shortcut effectively addresses computational redundancy in MIGMs through a lightweight feature evolution predictor, enabling significant speedups without quality degradation.

Abstract: Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

[144] Interpretable Debiasing of Vision-Language Models for Social Fairness

Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein, Hyunjung Shim

Main category: cs.CV

TL;DR: DeBiasLens: An interpretable, model-agnostic framework that uses sparse autoencoders to localize social attribute neurons in Vision-Language Models and mitigate bias by selectively deactivating them.

DetailsMotivation: Vision-Language models have black-box reasoning processes that could lead to unintended social bias. Current debiasing approaches only address surface-level bias signals through post-hoc methods without exploring internal model dynamics.

Method: Uses sparse autoencoders (SAEs) applied to multimodal encoders to localize social attribute neurons. SAEs are trained on facial image or caption datasets without social attribute labels to uncover neurons responsive to specific demographics. Selectively deactivates the social neurons most strongly tied to bias for each group.

Result: Effectively mitigates socially biased behaviors of VLMs without degrading their semantic knowledge. The framework is interpretable and model-agnostic.

Conclusion: Lays groundwork for future auditing tools to prioritize social fairness in emerging real-world AI systems by providing an interpretable approach to bias mitigation in multimodal models.

Abstract: The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.

[145] Ordinal Diffusion Models for Color Fundus Images

Gustav Schmidt, Philipp Berens, Sarah Müller

Main category: cs.CV

TL;DR: Ordinal latent diffusion model for generating diabetic retinopathy fundus images that incorporates ordered disease progression structure instead of treating stages as independent classes.

DetailsMotivation: Current generative image models treat disease stages as independent classes, ignoring the continuous nature of disease progression, which is problematic for medical imaging where pathological processes are continuous but observed through coarse, ordered labels.

Method: Proposed an ordinal latent diffusion model using scalar disease representation instead of categorical conditioning, enabling smooth transitions between adjacent DR severity stages in color fundus image generation.

Result: Model reduced Fréchet inception distance for 4 of 5 DR stages and increased quadratic weighted κ from 0.79 to 0.87 compared to standard conditional diffusion model, showing better visual realism and clinical consistency.

Conclusion: The ordinal latent diffusion model successfully captures continuous disease progression from ordered coarse labels, improving medical image generation quality and clinical relevance for diabetic retinopathy.

Abstract: It has been suggested that generative image models such as diffusion models can improve performance on clinically relevant tasks by offering deep learning models supplementary training data. However, most conditional diffusion models treat disease stages as independent classes, ignoring the continuous nature of disease progression. This mismatch is problematic in medical imaging because continuous pathological processes are typically only observed through coarse, discrete but ordered labels as in ophthalmology for diabetic retinopathy (DR). We propose an ordinal latent diffusion model for generating color fundus images that explicitly incorporates the ordered structure of DR severity into the generation process. Instead of categorical conditioning, we used a scalar disease representation, enabling a smooth transition between adjacent stages. We evaluated our approach using visual realism metrics and classification-based clinical consistency analysis on the EyePACS dataset. Compared to a standard conditional diffusion model, our model reduced the Fréchet inception distance for four of the five DR stages and increased the quadratic weighted $κ$ from 0.79 to 0.87. Furthermore, interpolation experiments showed that the model captured a continuous spectrum of disease progression learned from ordered, coarse class labels.

[146] Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

Chenwei Jia, Baoting Li, Xuchong Zhang, Mingzhuo Wei, Bochen Lin, Hongbin Sun

Main category: cs.CV

TL;DR: QE proposes token-aware adaptive error compensation with mixture-of-experts for vision-language models quantization, addressing distribution differences of important channels across modalities and tokens.

DetailsMotivation: Existing PTQ methods for VLMs rely on static identification and global compensation of sensitive channels, but overlook distributional differences of these important channels across inputs and modalities, leading to unsatisfactory quantization performance.

Method: Quant Experts (QE) divides important channels into token-independent and token-dependent groups. For token-independent channels, a shared expert uses low-rank adapter for global error compensation. For token-dependent channels, routed experts with multiple low-rank adapters compensate for local quantization errors specific to certain tokens.

Result: Extensive experiments show QE consistently enhances task accuracy across various quantization settings and model scales (2B to 70B parameters) while maintaining performance comparable to full-precision models.

Conclusion: QE effectively addresses the distributional differences of important channels in VLMs through token-aware adaptive error compensation, achieving superior quantization performance across different model scales and quantization settings.

Abstract: Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.

[147] SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

Xiang Feng, Xiangbo Wang, Tieshi Zhong, Chengkai Wang, Yiting Zhao, Tianxiang Xu, Zhenzhong Kuang, Feiwei Qin, Xuefei Yin, Yanming Zhu

Main category: cs.CV

TL;DR: SR3R is a feed-forward framework for 3D super-resolution that directly maps sparse low-resolution views to high-resolution 3D Gaussian Splatting representations, enabling robust generalization and real-time performance.

DetailsMotivation: Existing 3D super-resolution methods rely on dense inputs and per-scene optimization, limiting reconstruction fidelity, cross-scene generalization, and real-time usability by restricting high-frequency priors to those from pretrained 2D super-resolution models.

Method: SR3R reformulates 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations. It uses a learned mapping network with Gaussian offset learning and feature refinement to stabilize reconstruction and sharpen high-frequency details. The framework is plug-and-play with any feed-forward 3DGS reconstruction backbone.

Result: Extensive experiments across three 3D benchmarks show SR3R surpasses state-of-the-art 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.

Conclusion: SR3R fundamentally changes how 3D super-resolution acquires high-frequency knowledge by enabling autonomous learning of 3D-specific geometry and appearance from large-scale multi-scene data, leading to robust generalization and improved reconstruction fidelity.

Abstract: 3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.

[148] DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Yuxuan Zhang, Katarína Tóthová, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo de Lutio, Yen-Yu Chang, Or Litany, Sanja Fidler, Zan Gojcic

Main category: cs.CV

TL;DR: DiffusionHarmonizer: Online generative enhancement framework that transforms imperfect neural scene reconstructions into temporally consistent, realistic outputs for autonomous robot simulation.

DetailsMotivation: Neural reconstruction methods like NeRF and 3D Gaussian Splatting produce visually compelling results but suffer from artifacts in novel views and fail to realistically integrate inserted dynamic objects from different scenes, limiting their effectiveness for autonomous robot simulation.

Method: Introduces DiffusionHarmonizer, an online generative enhancement framework with a single-step temporally-conditioned enhancer converted from a pretrained multi-step image diffusion model. Uses custom data curation pipeline to construct synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism.

Result: Creates a scalable system that significantly elevates simulation fidelity in both research and production environments, capable of running in online simulators on a single GPU.

Conclusion: DiffusionHarmonizer overcomes limitations of current neural reconstruction methods by providing temporally consistent, realistic outputs for autonomous robot simulation through generative enhancement.

Abstract: Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.

[149] Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection

Zhaolin Cai, Fan Li, Huiyu Duan, Lijun He, Guangtao Zhai

Main category: cs.CV

TL;DR: SteerVAD: A novel intervention framework for video anomaly detection that actively steers and rectifies internal representations of frozen MLLMs using gradient-free representational separability analysis and hierarchical meta-controller for dynamic rectification signals.

DetailsMotivation: Traditional VAD methods suffer from high labeling costs and full training requirements. Existing MLLM-based VAD approaches inherit pre-training biases and cannot adapt to specific video contexts, struggling with subtle or ambiguous anomalies.

Method: 1) Uses gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs); 2) Employs hierarchical meta-controller (HMC) to generate dynamic rectification signals conditioned on global context and LAE outputs; 3) Executes targeted anisotropic scaling on LAE representation manifolds to amplify anomaly-relevant dimensions while suppressing biases.

Result: Achieves state-of-the-art performance among tuning-free approaches on mainstream benchmarks, requiring only 1% of training data.

Conclusion: SteerVAD establishes a powerful new direction for video anomaly detection by shifting from passive reading to active steering of MLLM representations, effectively handling subtle anomalies while maintaining efficiency.

Abstract: Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.

[150] Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Cesare Stefanini

Main category: cs.CV

TL;DR: TASOT is an unsupervised method for surgical phase/step recognition that uses multimodal optimal transport with visual and text-based costs, achieving strong zero-shot performance without surgical-specific pre-training.

DetailsMotivation: Current surgical video analysis methods rely on heavy pre-training on thousands of labeled videos, which incurs substantial computational and data collection costs. The authors question whether such heavy pre-training is truly necessary and propose an unsupervised alternative.

Method: TASOT extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated from videos. It formulates temporal action segmentation as a multimodal optimal transport problem with visual and text-based costs, regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation.

Result: TASOT achieves substantial improvements over existing zero-shot methods on multiple surgical datasets: StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6).

Conclusion: Fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations without resorting to complex pre-training pipelines, demonstrating the effectiveness of multimodal optimal transport approaches.

Abstract: Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.

[151] Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang

Main category: cs.CV

TL;DR: AIR is a training-free framework that reduces hallucination in multimodal LLMs by adaptively reinforcing visual tokens through prototype-based token reduction and OT-guided patch reinforcement.

DetailsMotivation: MLLMs suffer from hallucination where generated content deviates from visual evidence. Existing solutions require costly training supervision or add inference latency, while current vision enhancement methods inject all visual tokens indiscriminately, causing interference from background regions.

Method: AIR has two components: 1) Prototype-based token reduction condenses visual tokens into a compact subset to suppress redundancy, and 2) Optimal Transport-guided patch reinforcement quantifies alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers.

Result: Extensive experiments across representative MLLMs show AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.

Conclusion: AIR provides a training-free framework that enhances MLLMs’ reliance on salient visual information and effectively mitigates hallucination without requiring additional supervision or inference latency.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model’s reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.

[152] A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

Main category: cs.CV

TL;DR: Omnivorous Vision Encoder learns modality-agnostic feature space by aligning different modality representations of the same scene while distilling knowledge from frozen teacher models like DINOv2.

DetailsMotivation: Pre-trained vision encoders like DINOv2 perform well on unimodal tasks but have poor feature alignment across different modalities (e.g., RGB images and corresponding depth maps have similar cosine similarity to random unrelated images). This limits cross-modal understanding capabilities.

Method: Proposes Omnivorous Vision Encoder framework with dual objective: 1) maximize feature alignment between different modalities of the same scene, and 2) distillation objective that anchors learned representations to output of frozen teacher model (DINOv2). The student encoder becomes “omnivorous” by producing consistent embeddings across modalities.

Result: The resulting encoder produces consistent, powerful embeddings for scenes regardless of input modality (RGB, Depth, Segmentation, etc.), enabling robust cross-modal understanding while retaining discriminative semantics of original foundation model.

Conclusion: The Omnivorous Vision Encoder successfully addresses modality misalignment in pre-trained vision encoders, creating a modality-agnostic feature space that maintains strong semantic understanding while enabling cross-modal consistency.

Abstract: Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes “omnivorous” by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

[153] Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates

Yingxuan You, Ren Li, Corentin Dumery, Cong Cao, Hao Li, Pascal Fua

Main category: cs.CV

TL;DR: A unified framework for high-fidelity 3D garment reconstruction from monocular images/videos using Implicit Sewing Patterns with diffusion models for shape priors and spatio-temporal consistency.

DetailsMotivation: Accurate reconstruction of 3D clothed humans, especially loose-fitting garments, remains challenging despite progress in human body recovery. Applications include virtual try-on, avatar creation, and mixed reality.

Method: Combines Implicit Sewing Patterns (ISP) with generative diffusion models to learn garment shape priors in 2D UV space. Introduces mapping model connecting image pixels, UV coordinates, and 3D geometry. Extends to dynamic reconstruction with spatio-temporal diffusion and test-time guidance for temporal consistency, plus analytic projection-based constraints.

Result: Method generalizes well to real-world imagery despite synthetic training, outperforms existing approaches on both tight- and loose-fitting garments. Preserves fine geometric detail with realistic dynamic motion, supporting texture editing, garment retargeting, and animation.

Conclusion: Proposed framework enables high-fidelity 3D garment reconstruction from monocular inputs, addressing challenges in loose clothing reconstruction while maintaining temporal consistency for video applications.

Abstract: Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.

[154] EvalMVX: A Unified Benchmarking for Neural 3D Reconstruction under Diverse Multiview Setups

Zaiyan Yang, Jieji Ren, Xiangyi Wang, zonglin li, Xu Cao, Heng Guo, Zhanyu Ma, Boxin Shi

Main category: cs.CV

TL;DR: EvalMVX: A real-world dataset for benchmarking multiview 3D reconstruction techniques (MVS, MVPS, MVSfP) with 25 objects, 8,500 images under varying views and lighting, enabling quantitative comparison of different multiview methods.

DetailsMotivation: Current real-world datasets focus mainly on RGB-based multiview stereo (MVS), leaving multiview photometric stereo (MVPS) and multiview shape from polarization (MVSfP) unquantified despite their importance for high-fidelity surface reconstruction and sparse inputs.

Method: Created EvalMVX dataset containing 25 objects captured with polarized camera under 20 views and 17 lighting conditions (OLAT and natural illumination), totaling 8,500 images with aligned ground-truth 3D meshes for quantitative benchmarking.

Result: Evaluated 13 recent MVX methods, identified best-performing techniques, and revealed open problems under diverse geometric details and reflectance types, providing comprehensive benchmarking results.

Conclusion: EvalMVX enables simultaneous quantitative assessment of different multiview reconstruction techniques, inspiring future research in multiview 3D reconstruction by identifying current limitations and performance boundaries.

Abstract: Recent advancements in neural surface reconstruction have significantly enhanced 3D reconstruction. However, current real world datasets mainly focus on benchmarking multiview stereo (MVS) based on RGB inputs. Multiview photometric stereo (MVPS) and multiview shape from polarization (MVSfP), though indispensable on high-fidelity surface reconstruction and sparse inputs, have not been quantitatively assessed together with MVS. To determine the working range of different MVX (MVS, MVSfP, and MVPS) techniques, we propose EvalMVX, a real-world dataset containing $25$ objects, each captured with a polarized camera under $20$ varying views and $17$ light conditions including OLAT and natural illumination, leading to $8,500$ images. Each object includes aligned ground-truth 3D mesh, facilitating quantitative benchmarking of MVX methods simultaneously. Based on our EvalMVX, we evaluate $13$ MVX methods published in recent years, record the best-performing methods, and identify open problems under diverse geometric details and reflectance types. We hope EvalMVX and the benchmarking results can inspire future research on multiview 3D reconstruction.

[155] FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting

Matteo Ballegeer, Dries F. Benoit

Main category: cs.CV

TL;DR: FoV-Net is a rotation-invariant B-rep learning framework that uses Local Reference Frame UV-grids for local geometry and Field-of-View grids for global context, achieving SOTA performance on 3D CAD analysis tasks.

DetailsMotivation: Current B-rep learning methods rely on absolute coordinates and normals, making them highly sensitive to rotations (accuracy can drop from 95% to 10% under arbitrary SO(3) rotations). There's a need for rotation-invariant 3D CAD analysis methods.

Method: FoV-Net represents each face with: 1) Local Reference Frame (LRF) UV-grid encoding local surface geometry, and 2) Field-of-View (FoV) grids capturing surrounding 3D context via ray casting to neighboring faces. Lightweight CNNs extract per-face features, which are propagated using a graph attention network over the B-rep graph.

Result: Achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrates robustness to arbitrary rotations, and requires less training data to achieve strong results.

Conclusion: FoV-Net successfully addresses rotation sensitivity in B-rep learning by combining local geometry encoding with global context capture in a rotation-invariant manner, advancing 3D CAD analysis capabilities.

Abstract: Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary $\mathbf{SO}(3)$ rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.

[156] FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking

Sifan Zhou, Jiahao Nie, Ziyu Zhao, Yichao Cao, Xiaobo Lu

Main category: cs.CV

TL;DR: FocusTrack: A one-stage 3D point cloud object tracking framework that unifies motion-semantics co-modeling through inter-frame motion modeling and focus-and-suppress attention, achieving state-of-the-art performance at 105 FPS.

DetailsMotivation: Existing two-stage motion-based 3D point cloud tracking methods suffer from error accumulation due to decoupled optimization (foreground segmentation before motion estimation) and computational bottlenecks from sequential processing.

Method: Proposes FocusTrack with two core innovations: 1) Inter-frame Motion Modeling (IMM) using a temporal-difference siamese encoder to capture global motion patterns, and 2) Focus-and-Suppress Attention that enhances foreground semantics via motion-salient feature gating and suppresses background noise using temporal-aware motion context without explicit segmentation.

Result: Achieves state-of-the-art performance on prominent 3D tracking benchmarks (KITTI, nuScenes, Waymo) while running at 105 FPS with end-to-end training.

Conclusion: FocusTrack demonstrates that unified motion-semantics co-modeling in a one-stage framework can overcome limitations of two-stage approaches, achieving both superior accuracy and high efficiency in 3D point cloud object tracking.

Abstract: In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS.

[157] Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives

Haoran Wang, Guoxi Huang, Fan Zhang, David Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Efficient 3D Gaussian Splatting method with reconstruction-aware pruning and novel 3D Difference-of-Gaussians primitives that reduce model size by up to 90% while maintaining or improving visual quality.

DetailsMotivation: 3D Gaussian Splatting (3DGS) requires many primitives for high fidelity, leading to redundant representations and high resource consumption, limiting scalability for complex/large-scale scenes. Need efficient pruning strategies and more expressive primitives to reduce redundancy while preserving quality.

Method: 1) Reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality. 2) Novel 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving expressiveness under compact configurations.

Result: Significantly improves model compactness with up to 90% reduction in Gaussian count while delivering visual quality similar to or better than state-of-the-art methods.

Conclusion: The proposed method enables more efficient 3D scene representation with reduced resource consumption while maintaining high visual quality, making 3DGS more practical for complex and large-scale scenes.

Abstract: Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.

[158] Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation

Muquan Li, Hang Gou, Yingyi Ma, Rongzheng Wang, Ke Qin, Tao He

Main category: cs.CV

TL;DR: RETA improves decoupled dataset distillation by using dynamic retrieval to select optimal real patches and persistent topology alignment to maintain feature diversity, achieving state-of-the-art performance on multiple image datasets.

DetailsMotivation: Current decoupled dataset distillation methods suffer from fit-complexity gap and pull-to-anchor effect due to static real patches, which reduces intra-class diversity and hurts generalization performance.

Method: Proposes RETA with two components: 1) Dynamic Retrieval Connection (DRC) that selects optimal real patches from a prebuilt pool by minimizing fit-complexity score in teacher feature space, and 2) Persistent Topology Alignment (PTA) that regularizes synthesis using persistent homology to compute topology discrepancies between real and synthetic feature graphs.

Result: Achieves 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over prior best methods. Consistently outperforms baselines across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets with comparable time and memory.

Conclusion: RETA effectively addresses the fit-complexity gap and pull-to-anchor effect in decoupled dataset distillation through dynamic patch retrieval and topology alignment, leading to improved generalization and state-of-the-art performance.

Abstract: Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher’s statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA – a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.

[159] HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

Keito Suzuki, Kunyao Chen, Lei Wang, Bang Du, Runfa Blark Li, Peng Liu, Ning Bi, Truong Nguyen

Main category: cs.CV

TL;DR: HumanOrbit generates 360° orbit videos around a person from a single image using video diffusion models for consistent multi-view synthesis and 3D mesh reconstruction.

DetailsMotivation: Existing image-based diffusion models for multi-view synthesis produce inconsistent results across views and fail to preserve original identity. Video diffusion models show promise for photorealistic generation aligned with prompts, inspiring a video-based approach for consistent human orbit generation.

Method: Proposes HumanOrbit, a video diffusion model for multi-view human image generation that synthesizes continuous camera rotations around subjects. Uses generated multi-view frames in a reconstruction pipeline to recover textured 3D meshes of subjects.

Result: Experimental results validate HumanOrbit’s effectiveness for multi-view image generation. Reconstructed 3D models exhibit superior completeness and fidelity compared to state-of-the-art baselines.

Conclusion: HumanOrbit successfully generates geometrically consistent novel views while preserving appearance and identity, enabling high-quality 3D reconstruction from single images through video diffusion modeling.

Abstract: We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.

[160] RAViT: Resolution-Adaptive Vision Transformer

Martial Guidez, Stefan Duffner, Christophe Garcia

Main category: cs.CV

TL;DR: RAViT is a multi-branch vision transformer framework that processes images at different resolutions to reduce computational cost while maintaining accuracy, with an early exit mechanism for adaptive accuracy-computation trade-offs.

DetailsMotivation: Vision transformers achieve excellent performance but have very high computational costs compared to alternatives like CNNs. The authors aim to reduce this computational burden while preserving accuracy.

Method: Proposes RAViT: a multi-branch network that processes multiple copies of the same image at different resolutions. Uses an early exit mechanism where predictions from lower-resolution branches inform higher-resolution branches, reducing computation. In a two-branch setup, first processes a downsampled image with one transformer, then uses that prediction together with the original image for final prediction with a second transformer.

Result: Evaluated on CIFAR-10, Tiny ImageNet, and ImageNet. Achieved equivalent accuracy to classical vision transformer models with only around 70% of FLOPs.

Conclusion: RAViT successfully reduces computational costs of vision transformers while maintaining accuracy through multi-resolution processing and early exit mechanisms, making vision transformers more practical for real-world applications.

Abstract: Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.

[161] Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images

Alexander Vieth, Boudewijn Lelieveldt, Elmar Eisemann, Anna Vilanova, Thomas Höllt

Main category: cs.CV

TL;DR: A superpixel hierarchy method for high-dimensional images that incorporates both attribute information and spatial layout to enable consistent exploration in both image and attribute spaces.

DetailsMotivation: Current hierarchical embedding techniques for high-dimensional images ignore spatial layout, making it difficult to explore regions of interest consistently across image and attribute spaces.

Method: Develops a superpixel hierarchy that considers both high-dimensional attribute manifold and spatial layout during construction, enabling congruent exploration in both spaces.

Result: Shows effectiveness through comparison with classical hierarchical embedding-based image exploration in two use cases, demonstrating improved consistency.

Conclusion: The proposed image-guided hierarchy enables more effective exploration of high-dimensional images by maintaining congruence between image regions and attribute abstractions.

Abstract: High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional embedding of the attribute space and a conventional image representation. Nowadays, such images can easily contain several million pixels. For such large datasets, hierarchical embedding techniques are better suited to represent the high-dimensional attribute space than flat dimensionality reduction methods. However, available hierarchical dimensionality reduction methods construct the hierarchy purely based on the attribute information and ignore the spatial layout of pixels in the images. This impedes the exploration of regions of interest in the image space, since there is no congruence between a region of interest in image space and the associated attribute abstractions in the hierarchy. In this paper, we present a superpixel hierarchy for high-dimensional images that takes the high-dimensional attribute manifold into account during construction. Through this, our method enables consistent exploration of high-dimensional images in both image and attribute space. We show the effectiveness of this new image-guided hierarchy in the context of embedding exploration by comparing it with classical hierarchical embedding-based image exploration in two use cases.

[162] GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

Chao Xu, Xiaochen Zhao, Xiang Deng, Jingxiang Sun, Zhuo Su, Donglin Di, Yebin Liu

Main category: cs.CV

TL;DR: A novel framework using geometry-aware diffusion to create photorealistic 4D head avatars from single portrait images with accurate 3D geometry and real-time rendering.

DetailsMotivation: Existing methods for 4D head avatar reconstruction from single images rely on 2D priors and struggle with consistent 3D geometry, creating a need for better geometry-aware approaches.

Method: Uses geometry-aware diffusion to jointly synthesize portrait images and surface normals, with a pose-free expression encoder for implicit expression representations, integrated into 3D Gaussian-based avatars.

Result: Substantially outperforms state-of-the-art in visual quality, expression fidelity, and cross-identity generalization while supporting real-time rendering.

Conclusion: The proposed framework successfully addresses 3D geometry consistency in head avatar reconstruction through geometry-aware diffusion priors.

Abstract: Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.

[163] A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification

Yixuan Liu, Kanwal K. Bhatia, Ahmed E. Fetit

Main category: cs.CV

TL;DR: Multimodal auditing framework for medical image classifiers that uses slice discovery methods with multimodal representations to identify systematic failures more effectively than unimodal approaches.

DetailsMotivation: Current medical image classifiers lack safety and reliability in practical settings. Existing auditing methods using unimodal features or metadata-based subgroup analyses have limited interpretability and often miss hidden systematic failures.

Method: Introduces an automated auditing framework that extends slice discovery methods to multimodal representations for medical applications. Uses MIMIC-CXR-JPG dataset and tests under common failure scenarios.

Result: Framework demonstrates strong capability in both failure discovery and explanation generation. Multimodal information allows more comprehensive and effective auditing, while unimodal variants show potential in resource-constrained scenarios.

Conclusion: Multimodal auditing framework improves safety and reliability assessment of medical image classifiers by capturing hidden systematic failures more effectively than existing approaches.

Abstract: Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework’s strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.

[164] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Yasaman Haghighi, Alexandre Alahi

Main category: cs.CV

TL;DR: SenCache: A sensitivity-aware caching framework for accelerating diffusion-based video generation by dynamically selecting caching timesteps based on model output sensitivity analysis.

DetailsMotivation: Diffusion models achieve state-of-the-art video generation quality but suffer from expensive inference due to many sequential denoising steps. Existing caching methods rely on heuristic criteria and require extensive tuning, lacking a principled approach.

Method: Proposes Sensitivity-Aware Caching (SenCache) that formalizes caching error through analysis of model output sensitivity to perturbations in denoising inputs (noisy latent and timestep). Uses sensitivity as predictor of caching error to create dynamic caching policy that adaptively selects caching timesteps per sample.

Result: Experiments on Wan 2.1, CogVideoX, and LTX-Video show SenCache achieves better visual quality than existing caching methods under similar computational budgets.

Conclusion: SenCache provides theoretical basis for adaptive caching, explains why prior empirical heuristics work partially, and extends them to dynamic, sample-specific approach for accelerating diffusion video generation.

Abstract: Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.

[165] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Albert Dominguez Mantes, Gioele La Manno, Martin Weigert

Main category: cs.CV

TL;DR: MuViT is a transformer architecture that fuses multi-resolution observations from microscopy images by embedding patches in a shared world-coordinate system with extended rotary positional embeddings.

DetailsMotivation: Modern microscopy produces gigapixel images with structures across multiple spatial scales, but most vision models operate at single resolutions or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data.

Method: MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. It uses multi-resolution MAE pretraining for scale-consistent representations.

Result: Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks.

Conclusion: Explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.

Abstract: Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.

[166] Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou

Main category: cs.CV

TL;DR: A method to improve spatial understanding in text-to-image generation using a specialized reward model trained on spatial relationship preference pairs

DetailsMotivation: Current text-to-image models struggle with complex spatial relationships, requiring multiple sampling attempts to achieve satisfactory results with intricate spatial prompts

Method: Constructed SpatialReward-Dataset with 80k+ preference pairs, built SpatialScore reward model to evaluate spatial relationship accuracy, and used it for online reinforcement learning in image generation

Result: SpatialScore outperforms leading proprietary models on spatial evaluation and enables significant gains in spatial understanding for image generation across multiple benchmarks

Conclusion: Specialized reward models for spatial relationships can effectively enhance spatial understanding in text-to-image generation systems

Abstract: Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

[167] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

Chengyan Deng, Zhangquan Chen, Li Yu, Kai Zhang, Xue Zhou, Wang Zhang

Main category: cs.CV

TL;DR: GTASR is a consistency training method for real-world image super-resolution that addresses consistency drift and geometric decoupling issues through trajectory alignment and dual-reference structural rectification.

DetailsMotivation: Diffusion-based Real-ISR achieves good perceptual quality but has high computational costs. Consistency models offer efficient inference but suffer from consistency drift accumulation and "Geometric Decoupling" - where pixel alignment doesn't preserve structural coherence.

Method: Proposes GTASR with Trajectory Alignment (TA) strategy to rectify tangent vector field via full-path projection, and Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints.

Result: Extensive experiments verify GTASR delivers superior performance over representative baselines while maintaining minimal latency.

Conclusion: GTASR provides an effective consistency training paradigm for Real-ISR that addresses key limitations of existing approaches while maintaining efficiency.

Abstract: Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term “Geometric Decoupling” - where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency. The code and model will be released at https://github.com/Blazedengcy/GTASR.

[168] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

Main category: cs.CV

TL;DR: The paper establishes theoretical geometric constraints for compositional generalization, showing representations must decompose linearly into orthogonal per-concept components, and validates this with vision models.

DetailsMotivation: Modern models trained on massive datasets still cover only a tiny fraction of combinatorial input space, raising questions about what representational structure enables generalization to unseen combinations of familiar concepts.

Method: Formalizes three desiderata for compositional generalization (divisibility, transferability, stability), derives necessary geometric constraints, and empirically evaluates predictions across vision models (CLIP, SigLIP, DINO).

Result: Representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and the degree of this structure correlates with compositional generalization on unseen combinations.

Conclusion: Linear structure in neural representations is a necessary consequence of compositional generalization, providing theoretical grounding for the Linear Representation Hypothesis and predicting representational geometry as models scale.

Abstract: Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.

[169] Hierarchical Action Learning for Weakly-Supervised Action Segmentation

Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao

Main category: cs.CV

TL;DR: HAL model uses hierarchical causal modeling with varying timescales for weakly-supervised action segmentation, proving identifiability of latent action variables and outperforming existing methods.

DetailsMotivation: Humans perceive actions through hierarchical key transitions while machines over-segment based on visual features. The observation that low-level visual and high-level action latent variables evolve at different rates (visual changes rapidly, actions change slowly) provides an opportunity for better hierarchical reasoning in video understanding.

Method: Proposes Hierarchical Action Learning (HAL) model with hierarchical causal data generation process where high-level latent actions govern low-level visual feature dynamics. Uses deterministic processes to align latent variables over time, hierarchical pyramid transformer to capture features and latent variables, and sparse transition constraint to enforce slower dynamics of high-level actions.

Result: The model proves strict identifiability of latent action variables under mild assumptions. Experimental results on several benchmarks show HAL significantly outperforms existing methods for weakly-supervised action segmentation.

Conclusion: HAL effectively addresses hierarchical reasoning in video understanding by modeling varying timescales between visual and action variables, demonstrating practical effectiveness in real-world applications through superior weakly-supervised action segmentation performance.

Abstract: Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.

[170] Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat

Main category: cs.CV

TL;DR: A training paradigm called “Mode Seeking meets Mean Seeking” that decouples local fidelity from long-term coherence for minute-scale video generation using a Decoupled Diffusion Transformer with global Flow Matching and local Distribution Matching heads.

DetailsMotivation: Scaling video generation from seconds to minutes faces a bottleneck: short-video data is abundant and high-fidelity, but coherent long-form data is scarce and limited to narrow domains.

Method: Proposes a Decoupled Diffusion Transformer with two heads: 1) global Flow Matching head trained via supervised learning on long videos for narrative structure, and 2) local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via mode-seeking reverse-KL divergence.

Result: Enables synthesis of minute-scale videos that learn long-range coherence from limited long videos while inheriting local realism from short-video teacher, resulting in a few-step fast long video generator that closes the fidelity-horizon gap.

Conclusion: The method effectively improves local sharpness, motion, and long-range consistency in minute-scale video generation by decoupling local fidelity from long-term coherence.

Abstract: Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

[171] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, Deqing Sun

Main category: cs.CV

TL;DR: UFO-4D is a unified feedforward framework that reconstructs dense 4D representations from unposed image pairs using dynamic 3D Gaussian splats, enabling joint estimation of geometry, motion, and camera pose.

DetailsMotivation: Current methods for dense 4D reconstruction from unposed images rely on slow test-time optimization or fragmented task-specific models, lacking a unified feedforward approach.

Method: Uses dynamic 3D Gaussian splats to jointly estimate 3D geometry, 3D motion, and camera pose in feedforward manner. Differentiably renders multiple signals from single representation, enabling self-supervised image synthesis loss that couples appearance, depth, and motion.

Result: Outperforms prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Enables high-fidelity 4D interpolation across novel views and time.

Conclusion: UFO-4D provides a unified feedforward framework for dense 4D reconstruction that overcomes data scarcity through synergistic multi-modal supervision and enables novel view and temporal interpolation.

Abstract: Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/

[172] R2GenCSR: Mining Contextual and Residual Information for LLMs-based Radiology Report Generation

Xiao Wang, Yuehang Li, Fuling Wang, Shiao Wang, Chuanfu Li, Bo Jiang

Main category: cs.CV

TL;DR: A novel radiology report generation framework using Mamba as vision backbone with linear complexity, enhanced by context retrieval from training samples to improve LLM-based report generation.

DetailsMotivation: Existing radiology report generation methods using Transformers have high computational complexity and struggle to extract effective visual information for LLMs to generate high-quality reports.

Method: Proposes a context-guided efficient framework: 1) Uses Mamba as vision backbone for linear complexity, 2) Performs context retrieval from training set during training using both positive and negative samples, 3) Feeds vision tokens, context information, and prompts to LLM for report generation.

Result: Extensive experiments on three X-ray datasets (IU X-Ray, MIMIC-CXR, CheXpert Plus) validate effectiveness, with Mamba achieving comparable performance to Transformer models but with linear complexity.

Conclusion: The proposed framework addresses computational efficiency and feature representation issues in radiology report generation, offering an effective solution for medical image analysis with LLMs.

Abstract: Inspired by the tremendous success of Large Language Models (LLMs), existing Radiology report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient radiology report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU X-Ray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code is available at https://github.com/Event-AHU/Medical_Image_Analysis.

[173] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Yuhan Liu, Lianhui Qin, Shengjie Wang

Main category: cs.CV

TL;DR: SV is a training-free framework that uses multiple lightweight draft VLMs to generate diverse reasoning paths, then a strong verdict VLM synthesizes them for accurate answers on information-intensive images.

DetailsMotivation: Large VLMs struggle with information-intensive images that densely interleave text and graphics, facing challenges in precise localization of critical cues and multi-hop reasoning to integrate dispersed evidence.

Method: Training-free framework with two stages: 1) Draft stage where small VLMs act as draft experts to generate diverse reasoning paths with localization candidates, 2) Verdict stage where a strong VLM synthesizes these paths to produce final answers, plus consensus expert selection to forward only high-agreement paths.

Result: Achieves consistent gains on challenging information-intensive and high-resolution VQA benchmarks including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K, with both error correction and cost-efficiency compared to large proprietary models.

Conclusion: SV effectively addresses VLMs’ limitations on dense visual information by combining multiple draft experts with a verdict model, achieving improved performance on complex multimodal reasoning tasks without training.

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.

[174] CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

Runjian Chen, Yao Mu, Runsen Xu, Wenqi Shao, Chenhan Jiang, Hang Xu, Zhenguo Li, Ping Luo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2206.04028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2206.04028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[175] Uni-ISP: Toward Unifying the Learning of ISPs from Multiple Mobile Cameras

Lingen Li, Mingde Yao, Xingyu Meng, Muquan Yu, Tianfan Xue, Jinwei Gu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2406.01003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.01003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[176] SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

Main category: cs.CV

TL;DR: SpatiaLab is a comprehensive benchmark for evaluating vision-language models’ spatial reasoning capabilities in realistic, unconstrained contexts with 1,400 visual QA pairs across 6 categories and 30 task types.

DetailsMotivation: Spatial reasoning is fundamental to human cognition but remains a major challenge for VLMs. Prior work used synthetic/LLM-generated environments with limited task designs that fail to capture real-world complexity, visual noise, and diverse spatial relationships.

Method: Created SpatiaLab benchmark with 1,400 visual question-answer pairs across six categories (Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, 3D Geometry) with five subcategories each, yielding 30 distinct task types. Supports both multiple-choice and open-ended evaluation.

Result: Experiments show substantial gap between VLMs and humans: InternVL3.5-72B achieves 54.93% accuracy (multiple-choice) vs 87.57% for humans. In open-ended setting, GPT-5-mini scores highest at 40.93% vs 64.93% for humans. Models show 10-25% performance drop in open-ended setting.

Conclusion: SpatiaLab exposes critical limitations in VLMs’ spatial reasoning capabilities and provides a diverse, real-world evaluation framework to guide future research toward robust, human-aligned spatial understanding.

Abstract: Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs’ spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs’ spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

[177] Shuffle Mamba: State Space Models with Random Shuffle for Multi-Modal Image Fusion

Ke Cao, Xuanhua He, Tao Hu, Chengjun Xie, Man Zhou, Jie Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2409.01728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[178] TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

Runjian Chen, Hyoungseob Park, Bo Zhang, Wenqi Shao, Ping Luo, Alex Wong

Main category: cs.CV

TL;DR: Paper 2412.03054: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2412.03054: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.03054&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[179] CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

Runjian Chen, Hang Zhang, Avinash Ravichandran, Hyoungseob Park, Wenqi Shao, Alex Wong, Ping Luo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2412.03059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.03059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[180] GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, Yunhe Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2501.11340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.11340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[181] Spread them Apart: Towards Robust Watermarking of Generated Content

Mikhail Pautov, Danil Ivanov, Andrey V. Galichin, Oleg Rogov, Ivan Oseledets

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.07845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[182] JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

Runjian Chen, Wenqi Shao, Bo Zhang, Shaoshuai Shi, Li Jiang, Ping Luo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2503.08422: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08422&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[183] Autoregressive Image Generation with Randomized Parallel Decoding

Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2503.10568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[184] Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.21449: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21449&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[185] Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

Maria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2504.08578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[186] What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?

David Yan, Alexander Raistrick, Jia Deng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2504.16930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.16930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[187] On the use of Graphs for Satellite Image Time Series

Corentin Dufourg, Charlotte Pelletier, Stéphane May, Sébastien Lefèvre

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2505.16685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[188] Efficient Degradation-agnostic Image Restoration via Channel-Wise Functional Decomposition and Manifold Regularization

Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Hong Liu, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing the paper

Conclusion: Paper analysis impossible due to arXiv API rate limiting (HTTP 429 error)

Abstract: Failed to fetch summary for 2505.18679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[189] OmniFall: From Staged Through Synthetic to Wild, A Unified Multi-Domain Dataset for Robust Fall Detection

David Schneider, Zdravko Marinov, Zeyun Zhong, Alexander Jaus, Rodi Düger, Rafael Baur, M. Saquib Sarfraz, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.19889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[190] Distilling Balanced Knowledge from a Biased Teacher

Seonghak Kim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.18496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[191] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, Joan Lasenby

Main category: cs.CV

TL;DR: LiteReality converts RGB-D scans into compact, realistic, interactive 3D virtual replicas with object individuality, articulation, PBR materials, and physical interactions for AR/VR, gaming, robotics, and digital twins.

DetailsMotivation: Need to create realistic, interactive 3D virtual replicas from real-world scans that support graphics pipeline features like object individuality, articulation, high-quality materials, and physical interactions for applications in AR/VR, gaming, robotics, and digital twins.

Method: Four-stage pipeline: 1) Scene understanding and parsing into 3D layout and objects via structured scene graph, 2) Retrieving visually similar 3D artist-crafted models from curated database, 3) Material Painting module for high-quality spatially varying materials, 4) Integration into simulation engine with physical properties for interactive behavior.

Result: Achieves state-of-the-art similarity performance on Scan2CAD benchmark with training-free object retrieval; robust material painting transfers appearances from images of any style to 3D assets under misalignment, occlusion, and poor lighting; produces compact, editable scenes compatible with standard graphics pipelines.

Conclusion: LiteReality effectively converts RGB-D scans into realistic, interactive 3D virtual replicas with graphics pipeline compatibility, demonstrating strong performance on real-life scans and public datasets for AR/VR, gaming, robotics, and digital twin applications.

Abstract: We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines – such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets – even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c

[192] Empowering Small VLMs to Think with Dynamic Memorization and Exploration

Jiazhen Liu, Yuchuan Deng, Long Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.23061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[193] Concept-based Adversarial Attack: a Probabilistic Perspective

Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski

Main category: cs.CV

TL;DR: Unable to analyze paper 2507.02965 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2507.02965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[194] SelvaBox: A high-resolution dataset for tropical tree crown detection

Hugo Baudchon, Arthur Ouaknine, Martin Weiss, Mélisande Teng, Thomas R. Walla, Antoine Caron-Guay, Christopher Pal, Etienne Laliberté

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.00170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[195] Knowledge-Guided Machine Learning: Illustrating the use of Explainable Boosting Machines to Identify Overshooting Tops in Satellite Imagery

Nathan Mitchell, Lander Ver Hoef, Imme Ebert-Uphoff, Kristina Moen, Kyle Hilburn, Yoonjin Lee, Emily J. King

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: No method information available - arXiv API rate limiting prevented access to paper details

Result: No results available - paper content retrieval failed with HTTP 429 error

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2507.03183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[196] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models

Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.05394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[197] DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction in Autonomous Driving

Yuchen Zhou, Yan Luo, Xiaogang Wang, Xingjian Gu, Mingzhou Lu, Xiangbo Shu

Main category: cs.CV

TL;DR: Paper ID 2507.23599 appears to be unavailable due to HTTP 429 error (rate limiting), preventing access to the abstract and content for analysis.

DetailsMotivation: Unable to determine motivation as the paper content is inaccessible due to server rate limiting.

Method: Cannot analyze method without access to the paper content.

Result: No results can be analyzed due to content unavailability.

Conclusion: The paper cannot be properly analyzed due to technical limitations preventing access to its content.

Abstract: Failed to fetch summary for 2507.23599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[198] AutoDebias: Automated Framework for Debiasing Text-to-Image Models

Hongyi Cai, Mohammad Mahdinur Rahman, Mingkang Dong, Muxin Pu, Moqyad Alqaily, Jie Li, Xinfeng Li, Jialie Shen, Meikang Qiu, Qingsong Wen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2508.00445

DetailsMotivation: Unable to determine motivation due to API request failure. The paper content could not be retrieved from arXiv.

Method: Unable to determine method due to API request failure. The paper content could not be retrieved from arXiv.

Result: Unable to determine results due to API request failure. The paper content could not be retrieved from arXiv.

Conclusion: Unable to determine conclusion due to API request failure. The paper content could not be retrieved from arXiv.

Abstract: Failed to fetch summary for 2508.00445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[199] AnimateScene: Camera-controllable Animation in Any Scene

Qingyang Liu, Bingjie Gao, Weiheng Huang, Jun Zhang, Zhongqian Sun, Yang Wei, Fengrui Liu, Zelin Peng, Qianli Ma, Shuai Yang, Zhaohe Liao, Haonan Zhao, Li Niu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.05982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[200] CLIFF: Continual Learning for Incremental Flake Features in 2D Material Identification

Sankalp Pandey, Xuan Bac Nguyen, Nicholas Borys, Hugh Churchill, Khoa Luu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper due to technical fetching error

Abstract: Failed to fetch summary for 2508.17261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[201] Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, Zhen Lei

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2508.21048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[202] Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Ziyun Zeng, David Junhao Zhang, Wei Li, Mike Zheng Shou

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.01986 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2509.01986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[203] MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.07021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[204] ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong

Main category: cs.CV

TL;DR: Unable to analyze paper 2509.24837 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2509.24837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[205] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.00060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[206] Activation Function Design Sustains Plasticity in Continual Learning

Lute Lillo, Nick Cheney

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.22562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[207] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.08638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[208] Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction

Fengzhi Guo, Chih-Chuan Hsu, Sihao Ding, Cheng Zhang

Main category: cs.CV

TL;DR: USplat4D introduces uncertainty-aware dynamic Gaussian Splatting for 4D scene reconstruction, using per-Gaussian uncertainty estimation to propagate reliable motion cues and improve reconstruction quality under occlusion and extreme viewpoints.

DetailsMotivation: Dynamic 3D scene reconstruction from monocular input is under-constrained with ambiguities from occlusion and extreme novel views. Current dynamic Gaussian Splatting models optimize all Gaussian primitives uniformly, ignoring observation reliability, leading to motion drifts and degraded synthesis quality.

Method: Proposes USplat4D framework that estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Reliably observed Gaussians act as anchors to guide motion propagation to less reliable ones.

Result: Experiments on diverse real and synthetic datasets show consistent improvements over baseline dynamic Gaussian Splatting models, with more stable geometry under occlusion and higher-quality synthesis at extreme viewpoints.

Conclusion: Explicitly modeling uncertainty in dynamic Gaussian Splatting significantly enhances 4D reconstruction quality by propagating reliable motion cues and addressing observation reliability variations across time and viewpoints.

Abstract: Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our approach estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

[209] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.14896: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14896&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[210] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.21171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[211] DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2511.05271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[212] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Jieru Lin, Zhiwei Yu, Börje F. Karlsson

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2511.17649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[213] The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators

Mansi Sakarvadia, Kareem Hegazy, Amin Totounferoush, Kyle Chard, Yaoqing Yang, Ian Foster, Michael W. Mahoney

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.06646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[214] Score-Regularized Joint Sampling with Importance Weights for Flow Matching

Xinshuang Liu, Runfa Blark Li, Shaoxiu Wei, Truong Nguyen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2511.17812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[215] General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification

Helia Abedini, Saba Rahimi, Reza Vaziri

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.18326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[216] Q-Save: Towards Scoring and Attribution for Generated Video Evaluation

Xiele Wu, Zicheng Zhang, Mingtao Chen, Yixian Liu, Yiming Liu, Shushi Wang, Zhichao Hu, Yuhong Liu, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2511.18825: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18825&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[217] Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Pengfei Hu, Meng Cao, Yingyao Wang, Yi Wang, Jiahua Dong, Jun Song, Yu Cheng, Bo Zheng, Xiaodan Liang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.00805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[218] TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

Zishuo Wan, Qinqin Kang, Na Li, Yi Huang, Qianru Zhang, Le Lu, Yun Bian, Dawei Ding, Ke Yan

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Method unknown - paper content not accessible due to API rate limiting

Result: No results available - failed to fetch paper summary

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2512.04576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[219] Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.05651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[220] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2512.08016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[221] SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen, Mingchao Sun, Junjun Hu, Shichao Xie, Kuan Yang, Pei Shi, Zhining Gu, Lu Liu, Honglin Han, Xiaolong Wu, Mu Xu, Yu Zhang, Ning Guo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.21135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos, Peiyun Hu, Bruno Lecouat, Mingmin Zhen, Amaël Delaunoy, Tian Fang, Yanghai Tsin, Stephan R. Richter, Vladlen Koltun

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2512.10685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[223] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.22939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[224] Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2601.10553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[225] CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

Jiyuan Xu, Wenyu Zhang, Xin Jing, Shuai Chen, Shuai Zhang, Jiahao Nie

Main category: cs.CV

TL;DR: Paper 2601.20318 appears to be unavailable due to HTTP 429 error (rate limiting), preventing access to the abstract and content for analysis.

DetailsMotivation: Unable to determine motivation due to paper content being inaccessible.

Method: Unable to determine method due to paper content being inaccessible.

Result: Unable to determine results due to paper content being inaccessible.

Conclusion: Unable to determine conclusion due to paper content being inaccessible.

Abstract: Failed to fetch summary for 2601.20318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[226] Imagine a City: CityGenAgent for Procedural 3D City Generation

Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, Zhengzhe Liu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.05362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[227] PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Hong-Phuc Lai, Phong Nguyen, Anh Tran

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.12769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[228] COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception

Shilpa Mukhopadhyay, Amit Roy-Chowdhury, Hang Qiu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.13287 could not be retrieved from arXiv API.

DetailsMotivation: Unable to determine motivation since the paper content could not be fetched due to rate limiting on the arXiv API.

Method: No method information available due to failed API request.

Result: No results available due to failed API request.

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content.

Abstract: Failed to fetch summary for 2602.13287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation

Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.13585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

Mohammad Asim, Christopher Wewer, Jan Eric Lenssen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Could not determine motivation as paper content is unavailable

Method: Could not determine method as paper content is unavailable

Result: Could not determine results as paper content is unavailable

Conclusion: Could not determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.18882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

Yu Li, Yujun Cai, Chi Zhang

Main category: cs.CV

TL;DR: Paper ID 2602.18936 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed.

DetailsMotivation: Unable to determine motivation as the abstract could not be retrieved from arXiv API due to rate limiting.

Method: Unable to determine method as the abstract could not be retrieved from arXiv API due to rate limiting.

Result: Unable to determine results as the abstract could not be retrieved from arXiv API due to rate limiting.

Conclusion: Unable to draw conclusions as the abstract could not be retrieved from arXiv API due to rate limiting.

Abstract: Failed to fetch summary for 2602.18936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] Erase at the Core: Representation Unlearning for Machine Unlearning

Jaewon Lee, Yongwoo Kim, Donghyun Kim

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.05375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Junhyeok Choi, Sangwoo Mo, Minwoo Chae

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.19756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image

Pengfei Wang, Liyi Chen, Zhiyuan Ma, Yanjun Guo, Guowen Zhang, Lei Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to failed API request

Method: Cannot determine method due to failed API request

Result: Cannot determine results due to failed API request

Conclusion: Cannot determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2602.19766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] FlowFixer: Towards Detail-Preserving Subject-Driven Generation

Jinyoung Jun, Won-Dong Jang, Wenbin Ouyang, Raghudeep Gadde, Jungbeom Lee

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.21402 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.21402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Liangbing Zhao, Le Zhuo, Sayak Paul, Hongsheng Li, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: Paper 2602.21778 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to abstract fetching failure

Method: Unable to determine method due to abstract fetching failure

Result: Unable to determine results due to abstract fetching failure

Conclusion: Unable to determine conclusion due to abstract fetching failure

Abstract: Failed to fetch summary for 2602.21778: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21778&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[237] SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance

Minghan Yang, Lan Yang, Ke Li, Honggang Zhang, Kaiyue Pang, Yizhe Song

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.21819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Don’t let the information slip away

Taozhe Li, Guansu Wang, Bo Yu, Yiming Liu, Wei Sun

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2602.22595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] GFRRN: Explore the Gaps in Single Image Reflection Removal

Yu Chen, Zewei He, Xingyu Liu, Zixuan Chen, Zheming Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.22695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning

Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.21399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu, Tat-Jen Cham, Chuanxia Zheng

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Unable to determine paper motivation due to API request failure

Method: Unable to determine paper method due to API request failure

Result: Unable to determine paper results due to API request failure

Conclusion: Unable to determine paper conclusion due to API request failure

Abstract: Failed to fetch summary for 2602.22745: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22745&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] WARM-CAT: Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan, Songhe Feng, Jiaxin Wang, Xin Su, Yi Jin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2602.23114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] Motion-aware Event Suppression for Event Cameras

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.23204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

Wangyu Wu, Zhenhong Chen, Wenqiao Zhang, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, Jimin Xiao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.17966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.17966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] Conformal Prediction for Long-Tailed Classification

Tiffany Ding, Jean-Baptiste Fermanian, Joseph Salmon

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.06867: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06867&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining

Yanchen Wang, Han Yu, Ari Blau, Yizi Zhang, International Brain Laboratory, Liam Paninski, Cole Hurwitz, Matt Whiteway

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2507.09513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] BeeNet: Reconstructing Flower Shapes from Electric Fields using Deep Learning

Jake Turley, Ryan A. Palmer, Isaac V. Chenchiah, Daniel Robert

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2508.11724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] CLEAR-IR: Clarity-Enhanced Active Reconstruction of Infrared Imagery

Nathan Shankar, Pawel Ladosz, Hujun Yin

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.04883 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to access limitations.

Abstract: Failed to fetch summary for 2510.04883: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04883&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

Nikolaos Tsagkas, Andreas Sochopoulos, Duolikun Danier, Sethu Vijayakumar, Alexandros Kouris, Oisin Mac Aodha, Chris Xiaoxuan Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.10762: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10762&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[251] HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

Shubh Laddha, Lucas Changbencharoen, Win Kuptivej, Surya Shringla, Archana Vaidheeswaran, Yash Bhaskar

Main category: cs.AI

TL;DR: First large-scale MCP dataset with diverse, realistic user queries for 2800 tools across 308 MCP servers, addressing the gap in evaluating tool usage ecosystems.

DetailsMotivation: Existing datasets lack realistic, human-like user queries for evaluating Model Context Protocol (MCP) servers, leading to poor generalization and inflated benchmark reliability. Current datasets contain tool descriptions but fail to represent how different users portray their requests.

Method: Developed a large-scale MCP dataset featuring diverse, high-quality user queries specifically matched to 2800 tools across 308 MCP servers, building on the MCP Zero dataset. Each tool is paired with multiple unique user personas to capture varying levels of user intent, from precise task requests to ambiguous, exploratory commands.

Result: Created the first comprehensive MCP dataset that reflects real-world interaction patterns with diverse user personas and varying intent complexity, addressing the critical gap in evaluating MCP server tool usage ecosystems.

Conclusion: This dataset provides a more realistic foundation for evaluating MCP server performance and tool usage capabilities, enabling better assessment of how LLMs interact with external systems through standardized tools.

Abstract: Model Context Protocol (MCP) servers contain a collection of thousands of open-source standardized tools, linking LLMs to external systems; however, existing datasets and benchmarks lack realistic, human-like user queries, remaining a critical gap in evaluating the tool usage and ecosystems of MCP servers. Existing datasets often do contain tool descriptions but fail to represent how different users portray their requests, leading to poor generalization and inflated reliability of certain benchmarks. This paper introduces the first large-scale MCP dataset featuring diverse, high-quality diverse user queries generated specifically to match 2800 tools across 308 MCP servers, developing on the MCP Zero dataset. Each tool is paired with multiple unique user personas that we have generated, to capture varying levels of user intent ranging from precise task requests, and ambiguous, exploratory commands, reflecting the complexity of real-world interaction patterns.

[252] An Agentic LLM Framework for Adverse Media Screening in AML Compliance

Pavel Chernakov, Sasan Jafarnejad, Raphaël Frank

Main category: cs.AI

TL;DR: An agentic system using LLMs with RAG for automated adverse media screening in AML/KYC compliance, reducing false positives and manual review.

DetailsMotivation: Traditional adverse media screening for AML/KYC compliance relies on keyword-based searches that generate high false-positive rates and require extensive manual review, creating inefficiencies in financial institutions' compliance processes.

Method: Multi-step agentic system using LLMs with Retrieval-Augmented Generation (RAG) where an LLM agent searches the web, retrieves and processes relevant documents, and computes an Adverse Media Index (AMI) score for each subject.

Result: System evaluated using multiple LLM backends on dataset comprising PEPs, regulatory watchlist persons, sanctioned persons from OpenSanctions, and clean names from academic sources, demonstrating ability to distinguish between high-risk and low-risk individuals.

Conclusion: The LLM-based agentic system with RAG provides an effective automated solution for adverse media screening that improves accuracy and reduces manual workload in financial compliance processes.

Abstract: Adverse media screening is a critical component of anti-money laundering (AML) and know-your-customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword-based searches that generate high false-positive rates or require extensive manual review. We present an agentic system that leverages Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to automate adverse media screening. Our system implements a multi-step approach where an LLM agent searches the web, retrieves and processes relevant documents, and computes an Adverse Media Index (AMI) score for each subject. We evaluate our approach using multiple LLM backends on a dataset comprising Politically Exposed Persons (PEPs), persons from regulatory watchlists, and sanctioned persons from OpenSanctions and clean names from academic sources, demonstrating the system’s ability to distinguish between high-risk and low-risk individuals.

[253] Causal Identification from Counterfactual Data: Completeness and Bounding Results

Arvind Raghavan, Elias Bareinboim

Main category: cs.AI

TL;DR: The paper presents CTFIDU+, a complete algorithm for identifying counterfactual queries from Layer 3 distributions, establishing fundamental limits of causal inference and deriving bounds for non-identifiable quantities.

DetailsMotivation: Recent work showed that some counterfactual distributions (Layer 3) can be directly estimated via experiments, raising the question of what additional counterfactual quantities become identifiable given access to such data.

Method: Developed the CTFIDU+ algorithm for identifying counterfactual queries from arbitrary Layer 3 distributions, proved its completeness, established theoretical limits of counterfactual identification from physically realizable distributions, and derived analytic bounds for non-identifiable quantities.

Result: CTFIDU+ is complete for counterfactual identification from Layer 3 data, establishes fundamental limits to exact causal inference in non-parametric settings, and shows counterfactual data helps tighten bounds for non-identifiable quantities in practice.

Conclusion: The work provides a complete solution for counterfactual identification from realizable Layer 3 data, establishes theoretical limits of causal inference, and offers practical bounds for quantities that cannot be exactly identified.

Abstract: Previous work establishing completeness results for $\textit{counterfactual identification}$ has been circumscribed to the setting where the input data belongs to observational or interventional distributions (Layers 1 and 2 of Pearl’s Causal Hierarchy), since it was generally presumed impossible to obtain data from counterfactual distributions, which belong to Layer 3. However, recent work (Raghavan & Bareinboim, 2025) has formally characterized a family of counterfactual distributions which can be directly estimated via experimental methods - a notion they call $\textit{counterfactual realizabilty}$. This leaves open the question of what $\textit{additional}$ counterfactual quantities now become identifiable, given this new access to (some) Layer 3 data. To answer this question, we develop the CTFIDU+ algorithm for identifying counterfactual queries from an arbitrary set of Layer 3 distributions, and prove that it is complete for this task. Building on this, we establish the theoretical limit of which counterfactuals can be identified from physically realizable distributions, thus implying the $\textit{fundamental limit to exact causal inference in the non-parametric setting}$. Finally, given the impossibility of identifying certain critical types of counterfactuals, we derive novel analytic bounds for such quantities using realizable counterfactual data, and corroborate using simulations that counterfactual data helps tighten the bounds for non-identifiable quantities in practice.

[254] Planning under Distribution Shifts with Causal POMDPs

Matteo Ceriscioli, Karthika Mohan

Main category: cs.AI

TL;DR: A theoretical framework for planning under distribution shifts using causal POMDPs, enabling evaluation of plans under environmental changes and maintaining tractable planning via α-vector methods.

DetailsMotivation: Real-world planning often fails due to distribution shifts where environment models become invalid as conditions change, causing previously learned strategies to fail.

Method: Proposes using Partially Observable Markov Decision Processes (POMDPs) formulated with causal knowledge, representing shifts as interventions on causal POMDPs, maintaining beliefs over latent states and domain changes, and proving value function remains piecewise linear and convex.

Result: The framework enables evaluating plans under hypothesized environmental changes and actively identifying altered components, while preserving tractability of planning via α-vector-based POMDP methods.

Conclusion: Causal POMDPs provide a principled approach for planning under distribution shifts while maintaining computational tractability through preservation of PWLC properties.

Abstract: In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $α$-vector-based POMDP methods.

[255] Construct, Merge, Solve & Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem

Guillem Rodríguez-Corominas, Maria J. Blesa, Christian Blum

Main category: cs.AI

TL;DR: A hybrid approach combining reinforcement learning with exact optimization for solving the min-max multiple traveling salesman problem, focusing on workload balance through minimizing the longest tour.

DetailsMotivation: The min-max mTSP addresses workload balance in routing problems where multiple agents need to visit all customers while minimizing the maximum tour length. Existing methods need improvement for larger instances and more salesmen.

Method: RL-CMSA (Construct, Merge, Solve & Adapt with Reinforcement Learning) uses probabilistic clustering guided by learned q-values to construct diverse solutions, merges routes into a pool, solves a restricted set-covering MILP, and refines solutions with local search moves. Q-values are updated based on city-pair co-occurrences in high-quality solutions.

Result: The method consistently finds (near-)best solutions and outperforms a state-of-the-art hybrid genetic algorithm, especially as instance size and number of salesmen increase, on both random and TSPLIB instances.

Conclusion: The hybrid approach combining exact optimization and reinforcement-guided construction effectively balances exploration and exploitation for solving min-max mTSP problems, demonstrating superior performance over existing methods.

Abstract: The Multiple Traveling Salesman Problem (mTSP) extends the Traveling Salesman Problem to m tours that start and end at a common depot and jointly visit all customers exactly once. In the min-max variant, the objective is to minimize the longest tour, reflecting workload balance. We propose a hybrid approach, Construct, Merge, Solve & Adapt with Reinforcement Learning (RL-CMSA), for the symmetric single-depot min-max mTSP. The method iteratively constructs diverse solutions using probabilistic clustering guided by learned pairwise q-values, merges routes into a compact pool, solves a restricted set-covering MILP, and refines solutions via inter-route remove, shift, and swap moves. The q-values are updated by reinforcing city-pair co-occurrences in high-quality solutions, while the pool is adapted through ageing and pruning. This combination of exact optimization and reinforcement-guided construction balances exploration and exploitation. Computational results on random and TSPLIB instances show that RL-CMSA consistently finds (near-)best solutions and outperforms a state-of-the-art hybrid genetic algorithm under comparable time limits, especially as instance size and the number of salesmen increase.

[256] SleepLM: Natural-Language Intelligence for Human Sleep

Zongzhe Xu, Zitao Shuai, Eideen Mozaffari, Ravi S. Aysola, Rajesh Kumar, Yuzhe Yang

Main category: cs.AI

TL;DR: SleepLM is a multimodal foundation model that aligns sleep physiology data (polysomnography) with natural language, enabling language-based sleep analysis, interpretation, and interaction.

DetailsMotivation: Current sleep analysis systems operate in closed label spaces with predefined stages/events, failing to describe, query, or generalize to novel sleep phenomena. There's a need for more flexible, language-grounded sleep understanding.

Method: Developed a multilevel sleep caption generation pipeline to create a large-scale sleep-text dataset (100K+ hours from 10K+ individuals). Used unified pretraining combining contrastive alignment, caption generation, and signal reconstruction to capture physiological fidelity and cross-modal interactions.

Result: SleepLM outperforms state-of-the-art in zero-shot/few-shot learning, cross-modal retrieval, and sleep captioning. Exhibits novel capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks.

Conclusion: SleepLM bridges natural language and multimodal sleep physiology data, enabling more flexible and interpretable sleep analysis with language-based interaction capabilities.

Abstract: We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.

[257] Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

Xiang Li, Jiabao Gao, Sipei Lin, Xuan Zhou, Chi Zhang, Bo Cheng, Jiale Han, Benyou Wang

Main category: cs.AI

TL;DR: First Turing test for speech-to-speech systems reveals no existing system passes, with failure due to paralinguistic features, emotional expressivity, and conversational persona rather than semantic understanding.

DetailsMotivation: To determine whether modern speech-to-speech systems can converse like humans by conducting the first Turing test for S2S systems, moving beyond binary outcomes to provide diagnostic insights for improving conversational AI.

Method: Collected 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Developed fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotated dialogues accordingly. Proposed interpretable model leveraging fine-grained ratings for accurate human-vs-machine discrimination.

Result: No existing evaluated S2S system passes the Turing test. Bottleneck is not semantic understanding but paralinguistic features, emotional expressivity, and conversational persona. Off-the-shelf AI models perform unreliably as Turing test judges.

Conclusion: Establishes first human-likeness evaluation for S2S systems, providing detailed diagnostic insights rather than binary outcomes. Offers interpretable model for automatic human-likeness evaluation, paving way for human-like improvements in conversational AI.

Abstract: The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.

[258] MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Lun Zhan, Feng Xiong, Huanyong Liu, Feng Zhang, Yuhui Yin

Main category: cs.AI

TL;DR: MMKG-RDS is a flexible framework for synthesizing high-quality reasoning training data using multimodal knowledge graphs, addressing limitations in knowledge coverage, verification, and interpretability.

DetailsMotivation: Existing methods for synthesizing training data have limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches lack functionality, granularity, customizability, and proper evaluation.

Method: Proposes MMKG-RDS framework leveraging multimodal knowledge graphs with fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. Validated with MMKG-RDS-Bench dataset covering 5 domains, 17 task types, and 14,950 samples.

Result: Fine-tuning Qwen3 models (0.6B/8B/32B) on synthesized samples improves reasoning accuracy by 9.2%. The framework generates challenging data for tasks involving tables and formulas, useful for complex benchmark construction.

Conclusion: MMKG-RDS provides an effective solution for reasoning data synthesis with multimodal knowledge graphs, enabling better training data generation and benchmark construction for complex reasoning tasks.

Abstract: Synthesizing high-quality training data is crucial for enhancing domain models’ reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS

[259] AI Must Embrace Specialization via Superhuman Adaptable Intelligence

Judah Goldfeder, Philippe Wyder, Yann LeCun, Ravid Shwartz Ziv

Main category: cs.AI

TL;DR: The paper critiques the concept of Artificial General Intelligence (AGI) and proposes Superhuman Adaptable Intelligence (SAI) as a more useful framework, arguing AI should embrace specialization with superhuman performance rather than striving for human-like generality.

DetailsMotivation: The authors argue that current discussions about AGI are confused by overloaded definitions and flawed concepts. They question whether humans are truly "general" and whether striving for human-like generality is the right goal for AI development.

Method: The paper analyzes existing definitions of AGI, critiques their coherence and usefulness, and proposes an alternative framework called Superhuman Adaptable Intelligence (SAI). It explores the implications of shifting from AGI to SAI as a guiding concept for AI development.

Result: The authors conclude that SAI provides a clearer framework than AGI, focusing on specialized superhuman performance rather than human-like generality. This reframing helps clarify discussions about AI’s future direction and goals.

Conclusion: AI should embrace specialization with superhuman capabilities rather than striving for human-like generality. SAI offers a more useful framework for guiding AI development and discussing its future implications.

Abstract: Everyone from AI executives and researchers to doomsayers, politicians, and activists is talking about Artificial General Intelligence (AGI). Yet, they often don’t seem to agree on its exact definition. One common definition of AGI is an AI that can do everything a human can do, but are humans truly general? In this paper, we address what’s wrong with our conception of AGI, and why, even in its most coherent formulation, it is a flawed concept to describe the future of AI. We explore whether the most widely accepted definitions are plausible, useful, and truly general. We argue that AI must embrace specialization, rather than strive for generality, and in its specialization strive for superhuman performance, and introduce Superhuman Adaptable Intelligence (SAI). SAI is defined as intelligence that can learn to exceed humans at anything important that we can do, and that can fill in the skill gaps where humans are incapable. We then lay out how SAI can help hone a discussion around AI that was blurred by an overloaded definition of AGI, and extrapolate the implications of using it as a guide for the future.

[260] Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges

Patrick Taillandier, Jean Daniel Zucker, Arnaud Grignard, Benoit Gaudou, Nghi Quang Huynh, Alexis Drogoul

Main category: cs.AI

TL;DR: LLMs in social simulation: potential and limitations from computational social science perspective, with focus on hybrid approaches combining LLMs with traditional agent-based modeling.

DetailsMotivation: To examine LLMs' capabilities and limitations in social simulation contexts, particularly their ability to replicate human cognition and social behaviors, and to address challenges in behavioral fidelity and validation.

Method: Position paper methodology: review of LLM capabilities in human cognition replication, analysis of multi-agent simulation frameworks (Generative Agents, AgentSociety), and proposal of hybrid approaches integrating LLMs with established agent-based modeling platforms.

Result: Identifies LLMs’ strengths in Theory of Mind reasoning and social inference, but notes limitations including cognitive biases and behavioral inconsistencies. Proposes Hybrid Constitutional Architectures combining classical ABMs, SLMs, and LLMs.

Conclusion: LLMs show promise for operational applications like interactive simulations but raise epistemic concerns for explanatory modeling. Hybrid approaches in platforms like GAMA and NetLogo offer better balance between flexibility and transparency.

Abstract: This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing their potential and limitations from a computational social science perspective. We first review recent findings on LLMs’ ability to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while identifying persistent limitations such as cognitive biases, lack of grounded understanding, and behavioral inconsistencies. We then survey emerging applications of LLMs in multi-agent simulation frameworks, examining system architectures, scalability, and validation strategies. Projects such as Generative Agents (Smallville) and AgentSociety are analyzed with respect to their empirical grounding and methodological design. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large-scale LLM-driven simulations. Finally, we distinguish between contexts where LLM-based agents provide operational value-such as interactive simulations and serious games-and contexts where their use raises epistemic concerns, particularly in explanatory or predictive modeling. We argue that hybrid approaches integrating LLMs into established agent-based modeling platforms such as GAMA and NetLogo may offer a promising compromise between expressive flexibility and analytical transparency. Building on this analysis, we outline a conceptual research direction termed Hybrid Constitutional Architectures, which proposes a stratified integration of classical agent-based models (ABMs), small language models (SLMs), and LLMs within established platforms such as GAMA and NetLogo.

[261] PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents

Yihan, Wen, Xin Chen

Main category: cs.AI

TL;DR: PseudoAct: A framework using pseudocode synthesis for LLM agents to create structured plans with explicit control flow for efficient long-horizon tasks

DetailsMotivation: Reactive decision-making paradigms like ReAct in LLM agents lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching, iteration, or multi-tool coordination

Method: Synthesizes structured pseudocode plans that decompose tasks into subtasks with explicit control flow (sequencing, conditionals, loops, parallel composition), then executes actions by following this global plan

Result: Significantly outperforms existing reactive agent approaches, achieving 20.93% absolute gain in success rate on FEVER and setting new state-of-the-art on HotpotQA

Conclusion: PseudoAct enables consistent and efficient long-horizon decision-making by making decision logic explicit and temporally coherent, reducing redundant actions and preventing infinite loops

Abstract: Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching, iteration, or multi-tool coordination. To address these limitations, this paper introduces PseudoAct, a novel framework for flexible planning and action control in LLM agents through pseudocode synthesis. Leveraging the ability of LLMs to express task-solving strategies as code, PseudoAct synthesizes a structured pseudocode plan that decomposes a task into subtasks and explicitly encodes control flow, including sequencing, conditionals, loops, parallel composition, and combinations of these logic primitives. Actions are then executed by following this global plan, making the decision logic explicit and temporally coherent. This design reduces redundant actions, prevents infinite loops, and avoids uninformative alternative exploration, enabling consistent and efficient long-horizon decision-making. Experiments on benchmark datasets show that our method significantly outperforms existing reactive agent approaches, achieving a 20.93% absolute gain in success rate on FEVER and setting a new state-of-the-art on HotpotQA.

[262] ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Siyuan Ma, Bo Gao, Xiaojun Jia, Simeng Qin, Tianlin Li, Ke Ma, Xiaoshuang Jia, Wenqi Ren, Yang Liu

Main category: cs.AI

TL;DR: ODAR-Expert: Adaptive routing framework for LLMs that dynamically allocates compute between fast and slow agents based on query difficulty, using free-energy-based fusion for answer selection.

DetailsMotivation: Current LLM reasoning approaches rely on uniform brute-force sampling (like best-of-N or self-consistency) which is computationally expensive, hard to attribute, and suffers from diminishing returns due to overthinking. There's a need for adaptive resource allocation that optimizes the accuracy-efficiency trade-off.

Method: ODAR uses a difficulty estimator based on amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. It introduces a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy).

Result: Achieved 98.2% accuracy on MATH and 54.8% on Humanity’s Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. On open-source stack (Llama 4 + DeepSeek), ODAR surpassed homogeneous sampling strategies while reducing computational costs by 82%.

Conclusion: Thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute. The framework provides principled alternative to ad hoc voting over heterogeneous candidates.

Abstract: The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity’s Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.

[263] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems

Yawen Wang, Wenjie Wu, Junjie Wang, Qing Wang

Main category: cs.AI

TL;DR: CHIEF is a framework for hierarchical causal attribution of failures in LLM-powered multi-agent systems, transforming chaotic trajectories into structured causal graphs for precise root cause analysis.

DetailsMotivation: LLM-powered multi-agent systems show fragility with opaque failure mechanisms; existing methods treat execution logs as flat sequences, failing to capture intricate causal links and leading to weak observability and ambiguous responsibility boundaries.

Method: Proposes CHIEF framework that: 1) transforms chaotic trajectories into structured hierarchical causal graphs, 2) uses hierarchical oracle-guided backtracking to prune search space via synthesized virtual oracles, and 3) implements counterfactual attribution via progressive causal screening to distinguish root causes from symptoms.

Result: Experiments on Who&When benchmark show CHIEF outperforms eight strong baselines on both agent- and step-level accuracy. Ablation studies confirm the critical role of each proposed module.

Conclusion: CHIEF provides a robust framework for failure attribution in LLM-powered multi-agent systems by capturing hierarchical causal structures, enabling precise root cause analysis beyond flat sequence approaches.

Abstract: LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle-guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on Who&When benchmark show that CHIEF outperforms eight strong and state-of-the-art baselines on both agent- and step-level accuracy. Ablation studies further confirm the critical role of each proposed module.

[264] ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation

Jiangyuan Wang, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

Main category: cs.AI

TL;DR: ProductResearch is a multi-agent framework that generates synthetic training trajectories for e-commerce shopping agents, enabling LLMs to perform complex product research through collaborative agent interactions.

DetailsMotivation: Existing LLM-based e-commerce agents lack depth for complex product research, while deep research paradigms have domain gaps when applied to e-commerce, creating a need for specialized training data and frameworks.

Method: Multi-agent framework with User Agent (infers shopping intents), Supervisor Agent (orchestrates collaboration), and Research Agent (generates trajectories). Uses reflective internalization to consolidate multi-agent interactions into single-role training examples for fine-tuning.

Result: Compact MoE model fine-tuned on synthetic data shows substantial improvements in response comprehensiveness, research depth, and user-perceived utility, approaching performance of proprietary deep research systems.

Conclusion: Multi-agent synthetic trajectory training is an effective and scalable paradigm for enhancing LLM-based shopping assistance, enabling robust e-commerce conversational agents.

Abstract: Large Language Model (LLM)-based agents show promise for e-commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to e-commerce. We propose ProductResearch, a multi-agent framework that synthesizes high-fidelity, long-horizon tool-use trajectories for training robust e-commerce shopping agents. The framework employs a User Agent to infer nuanced shopping intents from behavioral histories, and a Supervisor Agent that orchestrates iterative collaboration with a Research Agent to generate synthetic trajectories culminating in comprehensive, insightful product research reports. These trajectories are rigorously filtered and distilled through a reflective internalization process that consolidates multi-agent supervisory interactions into coherent single-role training examples, enabling effective fine-tuning of LLM agents for complex shopping inquiries. Extensive experiments show that a compact MoE model fine-tuned on our synthetic data achieves substantial improvements over its base model in response comprehensiveness, research depth, and user-perceived utility, approaching the performance of frontier proprietary deep research systems and establishing multi-agent synthetic trajectory training as an effective and scalable paradigm for enhancing LLM-based shopping assistance.

[265] The Auton Agentic AI Framework

Sheng Cao, Zhao Chang, Chang Li, Hannan Li, Liyao Fu, Ji Tang

Main category: cs.AI

TL;DR: Auton Agentic AI Framework provides a principled architecture for autonomous agent systems with strict separation between declarative Cognitive Blueprint specifications and Runtime Engine execution, enabling cross-language portability, formal auditability, and modular tool integration.

DetailsMotivation: The transition from Generative AI to Agentic AI exposes a fundamental architectural mismatch: LLMs produce stochastic, unstructured outputs while backend infrastructure requires deterministic, schema-conformant inputs. Current systems lack standardization for creating, executing, and governing autonomous agents.

Method: The framework separates Cognitive Blueprint (declarative, language-agnostic specification) from Runtime Engine (platform-specific execution). It formalizes agent execution as augmented POMDP with latent reasoning space, introduces hierarchical memory consolidation, defines constraint manifold formalism for safety, presents three-level self-evolution framework, and implements runtime optimizations like parallel graph execution and speculative inference.

Result: The framework enables cross-language portability, formal auditability, and modular tool integration via Model Context Protocol (MCP). Runtime optimizations reduce end-to-end latency for multi-step agent workflows.

Conclusion: The Auton Agentic AI Framework provides a standardized architecture for autonomous agent systems that addresses the fundamental mismatch between LLM outputs and deterministic infrastructure requirements, enabling more robust, portable, and efficient agentic AI systems.

Abstract: The field of Artificial Intelligence is undergoing a transition from Generative AI – probabilistic generation of text and images – to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce stochastic, unstructured outputs, whereas the backend infrastructure they must control – databases, APIs, cloud services – requires deterministic, schema-conformant inputs. The present paper describes the Auton Agentic AI Framework, a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine, the platform-specific execution substrate that instantiates and runs the agent. This separation enables cross-language portability, formal auditability, and modular tool integration via the Model Context Protocol (MCP). The paper formalizes the agent execution model as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, introduces a hierarchical memory consolidation architecture inspired by biological episodic memory systems, defines a constraint manifold formalism for safety enforcement via policy projection rather than post-hoc filtering, presents a three-level self-evolution framework spanning in-context adaptation through reinforcement learning, and describes runtime optimizations – including parallel graph execution, speculative inference, and dynamic context pruning – that reduce end-to-end latency for multi-step agent workflows.

[266] Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Longyin Zhang, Shuo Sun, Yingxu He, Won Cheng Yi Lewis, Muhammad Huzaifah Bin Md Shahrin, Hardik Bhupendra Sailor, Heng Meng Jeremy Wong, Tarun Kumar Vangani, Yi Ma, Qiongqiong Wang, Minh Duc Pham, Ridong Jiang, Jingtao Li, Jingyi Liao, Zhuohan Liu, Yanfeng Lu, Manas Gupta, Ai Ti Aw

Main category: cs.AI

TL;DR: MERaLiON2-Omni (Alpha) is a 10B-parameter multilingual multimodal LLM for Southeast Asia with a progressive training pipeline that decouples perception and reasoning, revealing an efficiency-stability paradox where reasoning boosts abstract tasks but destabilizes low-level sensory processing.

DetailsMotivation: Current MLLMs lack robust sensory grounding with complex reasoning, especially for underrepresented regions like Southeast Asia. The paper aims to integrate perception and reasoning capabilities while addressing region-specific challenges like local languages and cultural contexts.

Method: Two-stage progressive training: 1) Build Perception Backbone by aligning region-specific audio-visual cues with multilingual LLM via orthogonal modality adaptation; 2) Inject reasoning via Generate-Judge-Refine pipeline using Super-LLM to filter hallucinations and synthesize silver data for multimodal Chain-of-Thought reasoning.

Result: Evaluation on SEA-Omni Benchmark reveals Efficiency-Stability Paradox: reasoning amplifies abstract task performance (math, instruction-following) but introduces instability in sensory processing - Temporal Drift in long-context audio and Visual Over-interpretation where logic overrides pixel reality.

Conclusion: The paper presents a novel approach to integrating perception and reasoning in MLLMs for underrepresented regions, identifies critical trade-offs between these capabilities, and provides diagnostic analysis of the perception-reasoning balance in multimodal systems.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual omni-perception tailored for Southeast Asia (SEA). We present a progressive training pipeline that explicitly decouples and then integrates “System 1” (Perception) and “System 2” (Reasoning) capabilities. First, we establish a robust Perception Backbone by aligning region-specific audio-visual cues (e.g., Singlish code-switching, local cultural landmarks) with a multilingual LLM through orthogonal modality adaptation. Second, to inject cognitive capabilities without large-scale supervision, we propose a cost-effective Generate-Judge-Refine pipeline. By utilizing a Super-LLM to filter hallucinations and resolve conflicts via a consensus mechanism, we synthesize high-quality silver data that transfers textual Chain-of-Thought reasoning to multimodal scenarios. Comprehensive evaluation on our newly introduced SEA-Omni Benchmark Suite reveals an Efficiency-Stability Paradox: while reasoning acts as a non-linear amplifier for abstract tasks (boosting mathematical and instruction-following performance significantly), it introduces instability in low-level sensory processing. Specifically, we identify Temporal Drift in long-context audio, where extended reasoning desynchronizes the model from acoustic timestamps, and Visual Over-interpretation, where logic overrides pixel-level reality. This report details the architecture, the data-efficient training recipe, and a diagnostic analysis of the trade-offs between robust perception and structured reasoning.

[267] Reasoning-Driven Multimodal LLM for Domain Generalization

Zhipeng Xu, Zilong Wang, Xinyang Jiang, Dongsheng Li, De Cheng, Nannan Wang

Main category: cs.AI

TL;DR: RD-MLDG uses multimodal LLMs with reasoning chains for domain generalization, addressing challenges in optimizing complex reasoning sequences and reasoning-pattern mismatches through multi-task cross-training and self-aligned reasoning regularization.

DetailsMotivation: Most domain generalization methods focus on visual feature invariance, but this paper explores leveraging multimodal LLMs' reasoning capabilities to construct reasoning chains for more robust predictions under domain shift.

Method: Proposes RD-MLDG with two components: MTCT (Multi-Task Cross-Training) adds a direct classification pathway to guide reasoning supervision, and SARR (Self-Aligned Reasoning Regularization) preserves semantic richness while mitigating reasoning-pattern mismatches via iterative self-labeling.

Result: Achieves state-of-the-art performance on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc), demonstrating reasoning as a promising complementary signal for robust out-of-domain generalization.

Conclusion: Reasoning chains from multimodal LLMs provide valuable complementary signals for domain generalization, overcoming challenges through careful optimization and alignment techniques.

Abstract: This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.

[268] EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

Yiyang Fang, Wenke Huang, Pei Fu, Yihao Yang, Kehua Su, Zhenbo Luo, Jian Luan, Mang Ye

Main category: cs.AI

TL;DR: EMO-R3 framework enhances multimodal LLMs’ emotional reasoning using structured thinking and reflective reinforcement learning

DetailsMotivation: Current MLLMs struggle with complex human emotions, suffering from limited generalization and poor interpretability in emotional understanding tasks

Method: Proposes Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3) with Structured Emotional Thinking for step-by-step reasoning and Reflective Emotional Reward for re-evaluation based on visual-text consistency

Result: Significantly improves interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks

Conclusion: EMO-R3 effectively addresses emotional reasoning challenges in MLLMs through structured thinking and reflective reinforcement learning

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition. To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.

[269] RUMAD: Reinforcement-Unifying Multi-Agent Debate

Chao Wang, Han Lin, Huaze Tang, Huijing Lin, Wenbo Ding

Main category: cs.AI

TL;DR: RUMAD is a reinforcement learning framework for dynamic communication topology control in multi-agent debate systems that optimizes accuracy, consensus, and efficiency without accessing agent reasoning content.

DetailsMotivation: Existing multi-agent debate systems struggle to balance accuracy, consensus formation, and computational efficiency. Static topologies lack adaptability, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality.

Method: RUMAD formulates dynamic communication topology control as a reinforcement learning problem. It uses a content-agnostic observation scheme capturing high-level debate dynamics, a multi-objective reward for solution quality, cohesion and efficiency, PPO-trained controller for dynamic edge weight adjustment, and a dual-threshold mechanism for agent activation and information visibility control.

Result: RUMAD achieves over 80% reduction in token costs while improving reasoning accuracy compared to single LLM models and multiple MAD baselines across MMLU, GSM8K, and GPQA benchmarks. It shows robust zero-shot generalization to out-of-domain tasks when trained only on MMLU.

Conclusion: RUMAD establishes an efficient and robust approach for deploying multi-agent reasoning applications with practical resource constraints, demonstrating that learned communication strategies capture task-independent principles of effective multi-agent coordination.

Abstract: Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content-agnostic observation scheme that captures high-level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi-objective reward to model solution quality, cohesion and efficiency. A PPO-trained controller dynamically adjusts edge weights in the communication graph, while a dual-threshold mechanism enables fine-grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero-shot generalization to out-of-domain (OOD) tasks, indicating that the learned communication strategies capture task-independent principles of effective multi-agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi-agent reasoning application with practical resource constraints.

Ning Gao, Xiuhui Zhang, Xingyu Jiang, Mukang You, Mohan Zhang, Yue Deng

Main category: cs.AI

TL;DR: RF-Agent uses LLMs as language agents with Monte Carlo Tree Search to design reward functions for low-level control tasks, improving on previous methods by better utilizing historical feedback and search efficiency.

DetailsMotivation: Designing efficient reward functions for low-level control tasks is challenging and typically requires expert experience. Existing LLM-based methods have poor utilization of historical feedback and inefficient search, limiting improvements in complex control tasks.

Method: Proposes RF-Agent framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process. Integrates Monte Carlo Tree Search (MCTS) to manage reward design and optimization, leveraging LLMs’ multi-stage contextual reasoning ability.

Result: Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of the method. The approach shows improved utilization of historical information and search efficiency for identifying promising reward functions.

Conclusion: RF-Agent effectively addresses limitations of previous LLM-based reward function design methods by better leveraging historical feedback and improving search efficiency through MCTS integration and sequential decision-making framing.

Abstract: Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF-Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process, enhancing optimization through better contextual reasoning. RF-Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi-stage contextual reasoning ability of LLMs. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method. The source code is available at https://github.com/deng-ai-lab/RF-Agent.

[271] Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Fan Zhang, Baoru Huang, Xin Zhang

Main category: cs.AI

TL;DR: Proposes a pessimistic auxiliary policy for offline RL that samples reliable actions by maximizing Q-function’s lower confidence bound to reduce approximation errors and overestimation.

DetailsMotivation: Offline RL faces challenges with out-of-distribution actions causing approximation errors, error accumulation, and overestimation during learning. Need to sample more reliable actions to mitigate these issues.

Method: Constructs a pessimistic auxiliary policy that maximizes the lower confidence bound of the Q-function. This policy exhibits high value and low uncertainty near the learned policy, avoiding sampling high-value actions with potentially high errors.

Result: Extensive experiments on offline RL benchmarks show the pessimistic auxiliary strategy effectively improves the efficacy of other offline RL approaches by reducing approximation errors.

Conclusion: The proposed pessimistic auxiliary policy successfully addresses error accumulation in offline RL by sampling more reliable actions, leading to improved performance across various benchmarks.

Abstract: Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

[272] Portfolio Reinforcement Learning with Scenario-Context Rollout

Vanya Priscillia Bendatu, Yao Lu

Main category: cs.AI

TL;DR: Proposes macro-conditioned scenario-context rollout (SCR) for generating plausible next-day multivariate return scenarios under stress events to improve portfolio rebalancing policies, addressing reward-transition mismatch in RL critic training.

DetailsMotivation: Market regime shifts cause distribution shifts that degrade portfolio rebalancing policy performance. Need to generate plausible next-day return scenarios under stress events, but history doesn't show what could have happened differently, creating challenges for RL training.

Method: Macro-conditioned scenario-context rollout (SCR) generates plausible next-day multivariate return scenarios. Analyzes reward-transition mismatch in temporal-difference learning, constructs counterfactual next states using rollout-implied continuations, and augments critic agent’s bootstrap target to stabilize learning.

Result: In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared to classic and RL-based portfolio rebalancing baselines.

Conclusion: The proposed approach successfully addresses reward-transition mismatch in RL training for portfolio rebalancing, providing stable learning and significant performance improvements in real-world financial applications.

Abstract: Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward–transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent’s bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines.

[273] CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda

Main category: cs.AI

TL;DR: CIRCLE is a six-stage lifecycle framework that bridges the gap between AI model performance metrics and real-world outcomes by formalizing stakeholder concerns into measurable signals through field testing, red teaming, and longitudinal studies.

DetailsMotivation: Current AI evaluation frameworks focus on system stability (MLOps) or abstract capabilities (benchmarks), but fail to provide systematic evidence about AI behavior under real-world user variability and constraints. Decision-makers lack evidence about materialized outcomes in deployment.

Method: CIRCLE operationalizes the Validation phase of TEVV by formalizing stakeholder concerns into measurable signals. It integrates field testing, red teaming, and longitudinal studies into a six-stage coordinated pipeline that links context-sensitive qualitative insights to scalable quantitative metrics.

Result: The framework produces systematic knowledge that is comparable across sites yet sensitive to local context, enabling governance based on materialized downstream effects rather than theoretical capabilities.

Conclusion: CIRCLE provides a structured, prospective protocol for bridging the reality gap between model-centric metrics and AI’s actual performance in real-world deployment, offering a more comprehensive approach to AI validation and governance.

Abstract: This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI’s materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.

[274] Bi-level RL-Heuristic Optimization for Real-world Winter Road Maintenance

Yue Xie, Zizhen Xu, William Beazley, Fumiya Iida

Main category: cs.AI

TL;DR: A bi-level optimization framework using RL for network partitioning and multi-objective VRP for winter road maintenance routing, validated on UK road networks.

DetailsMotivation: Existing winter road maintenance methods struggle with large-scale routing problems and rely heavily on human decision-making, lacking efficiency and scalability for real-world transportation networks.

Method: Bi-level optimization framework: upper level uses RL agent to partition road networks into clusters and allocate resources from multiple depots; lower level solves multi-objective VRP within each cluster minimizing maximum travel time and carbon emissions.

Result: Significant improvements including balanced workloads, reduced maximum travel times below 2-hour threshold, lower emissions, and substantial cost savings on UK strategic road networks (M25, M6, A1).

Conclusion: Advanced AI-driven bi-level optimization can enhance operational decision-making in real-world transportation and logistics for winter road maintenance.

Abstract: Winter road maintenance is critical for ensuring public safety and reducing environmental impacts, yet existing methods struggle to manage large-scale routing problems effectively and mostly reply on human decision. This study presents a novel, scalable bi-level optimization framework, validated on real operational data on UK strategic road networks (M25, M6, A1), including interconnected local road networks in surrounding areas for vehicle traversing, as part of the highway operator’s efforts to solve existing planning challenges. At the upper level, a reinforcement learning (RL) agent strategically partitions the road network into manageable clusters and optimally allocates resources from multiple depots. At the lower level, a multi-objective vehicle routing problem (VRP) is solved within each cluster, minimizing the maximum vehicle travel time and total carbon emissions. Unlike existing approaches, our method handles large-scale, real-world networks efficiently, explicitly incorporating vehicle-specific constraints, depot capacities, and road segment requirements. Results demonstrate significant improvements, including balanced workloads, reduced maximum travel times below the targeted two-hour threshold, lower emissions, and substantial cost savings. This study illustrates how advanced AI-driven bi-level optimization can directly enhance operational decision-making in real-world transportation and logistics.

[275] Artificial Agency Program: Curiosity, compression, and communication in agents

Richard Csaky

Main category: cs.AI

TL;DR: AAP proposes building AI as resource-bounded agents driven by curiosity-as-learning-progress, treating AI as part of extended human-tool systems to enhance sensing, understanding, and actuation while reducing interface friction.

DetailsMotivation: To develop AI systems as reality-embedded, resource-bounded agents that function as part of extended human-tool systems, increasing sensing, understanding, and actuation capabilities while reducing friction at human-tool-environment interfaces.

Method: Unifies predictive compression, intrinsic motivation, empowerment/control, interface quality, and language/self-communication as selective information bottlenecks. Formulates as falsifiable program with explicit costs, staged experiments, and concrete multimodal tokenized testbed where agent allocates limited budget among observation, action, and deliberation.

Result: Provides conceptual and experimental framework connecting intrinsic motivation, information theory, thermodynamics, bounded rationality, and modern reasoning systems.

Conclusion: AAP offers a research agenda for building AI as resource-bounded agents driven by curiosity-as-learning-progress, with potential to advance AI systems that better integrate with human-tool environments.

Abstract: This paper presents the Artificial Agency Program (AAP), a position and research agenda for building AI systems as reality embedded, resource-bounded agents whose development is driven by curiosity-as-learning-progress under physical and computational constraints. The central thesis is that AI is most useful when treated as part of an extended human–tool system that increases sensing, understanding, and actuation capability while reducing friction at the interface between people, tools, and environments. The agenda unifies predictive compression, intrinsic motivation, empowerment and control, interface quality (unification), and language/self-communication as selective information bottlenecks. We formulate these ideas as a falsifiable program with explicit costs, staged experiments, and a concrete multimodal tokenized testbed in which an agent allocates limited budget among observation, action, and deliberation. The aim is to provide a conceptual and experimental framework that connects intrinsic motivation, information theory, thermodynamics, bounded rationality, and modern reasoning systems

[276] Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang, Jiayan Qiu, Baosheng Yu, Quan Chen, Liu Liu

Main category: cs.AI

TL;DR: SCOPE framework uses Process Reward Models to identify first errors in reasoning trajectories and applies step-wise corrections to salvage partially correct rollouts, improving exploration diversity and reasoning performance.

DetailsMotivation: Standard RLVR with outcome-based supervision penalizes largely correct trajectories as heavily as completely wrong ones, causing models to discard valuable partially correct rollouts and prematurely narrow exploration space, despite Process Reward Models showing promise for step-wise verification.

Method: SCOPE utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification to salvage partially correct trajectories, increasing rollout diversity.

Result: Achieves 46.6% average accuracy on math reasoning tasks and 53.4% accuracy on out-of-distribution reasoning tasks, with 13.5% increase in diversity score, establishing new state-of-the-art results.

Conclusion: SCOPE effectively addresses the exploration space narrowing problem in RLVR by salvaging partially correct trajectories through step-wise correction, leading to improved reasoning performance and generalization.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model’s distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.

[277] LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat

Main category: cs.AI

TL;DR: A new benchmark for evaluating LLMs on research-level mathematics using automatically extracted lemmas from arXiv, enabling regular updates and preventing data contamination.

DetailsMotivation: Existing benchmarks use static, hand-curated contest or textbook problems as proxies for mathematical research, which doesn't reflect real research-level mathematics and can lead to data contamination when models are trained on benchmark data.

Method: An automatic pipeline extracts lemmas from arXiv papers and rewrites them into self-contained statements by making all assumptions and definitions explicit. This creates an updatable benchmark where new problems come directly from ongoing mathematical research.

Result: Current state-of-the-art LLMs achieve around 10-15% accuracy (pass@1) in theorem proving, showing significant room for improvement to reach human-level proving capabilities in research contexts.

Conclusion: The benchmark provides a more realistic evaluation of LLMs on research-level mathematics and can be regularly updated with new problems while preventing data contamination, revealing a large gap between current LLM capabilities and human-level mathematical research proficiency.

Abstract: We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.

[278] Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints

Shishun Zhang, Juzhan Xu, Yidan Fan, Chenyang Zhu, Ruizhen Hu, Yongjun Wang, Kai Xu

Main category: cs.AI

TL;DR: Proposes a deep reinforcement learning approach with heterogeneous graph networks for Flexible Job Shop Scheduling with Limited Buffers and Material Kitting, outperforming traditional methods on makespan and pallet change metrics.

DetailsMotivation: Current Flexible Job Shop Scheduling Problem (FJSP) studies often ignore practical constraints like limited buffers, which significantly impact production efficiency. The paper addresses this gap by studying an extended problem that includes limited buffers and material kitting constraints to better match real-world scenarios.

Method: Uses deep reinforcement learning (DRL) with a heterogeneous graph network to model global state. The network constructs efficient message passing among machines, operations, and buffers, focusing on avoiding decisions that cause frequent pallet changes during long-sequence scheduling to improve buffer utilization and decision quality.

Result: Experimental results on synthetic and real production line datasets show the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes. It also achieves a good balance between solution quality and computational cost.

Conclusion: The heterogeneous graph network within DRL framework effectively handles complex dependencies and long-term constraints in scheduling problems with limited buffers, providing a practical solution that bridges the gap between theoretical FJSP studies and real-world production scenarios.

Abstract: The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios–the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.

[279] Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low

Main category: cs.AI

TL;DR: UMPIRE is a training-free uncertainty quantification framework for Multimodal Large Language Models that measures uncertainty by computing the incoherence-adjusted semantic volume of sampled responses across various modalities without external tools.

DetailsMotivation: MLLMs often produce plausible but erroneous outputs, hindering reliable deployment. Existing uncertainty metrics have limitations: they're modality-specific, require external tools, or are computationally expensive. There's a need for a general, efficient uncertainty quantification method for MLLMs.

Method: UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses using only the models’ internal modality features. It captures both global semantic diversity of samples and local incoherence of responses based on internal model confidence, without requiring training or external tools.

Result: UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. It also generalizes to non-text output tasks like image and audio generation.

Conclusion: UMPIRE provides an effective, training-free uncertainty quantification framework for MLLMs that works across various modalities and output types, addressing practical constraints of existing methods and enabling more reliable deployment of MLLMs.

Abstract: Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models’ own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE’s design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE’s generalization to non-text output tasks, including image and audio generation.

[280] A Minimal Agent for Automated Theorem Proving

Borja Requena Pozo, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra

Main category: cs.AI

TL;DR: A minimal baseline agentic theorem prover that enables systematic comparison of AI-based theorem prover architectures, featuring iterative proof refinement, library search, and context management.

DetailsMotivation: To create a standardized baseline for comparing different AI-based theorem prover architectures, enabling systematic evaluation of design choices and model performance in a controlled setting.

Method: Designs a minimal agentic baseline implementing core features shared by state-of-the-art systems: iterative proof refinement, library search, and context management. Evaluates using diverse benchmarks and compares various models and design choices.

Result: Achieves competitive performance compared to state-of-the-art approaches with significantly simpler architecture. Demonstrates consistent advantages of iterative approach over single-shot generations in sample efficiency and cost effectiveness.

Conclusion: The minimal baseline provides a valuable reference for future research and an accessible theorem prover for the community, showing that simpler architectures can achieve competitive performance through systematic design.

Abstract: We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.

[281] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan

Main category: cs.AI

TL;DR: DARE-bench is a benchmark for evaluating LLMs on data science tasks with verifiable ground truth, addressing gaps in standardized process-aware evaluation and training data scarcity.

DetailsMotivation: Existing benchmarks lack standardized process-aware evaluation that captures instruction adherence and process fidelity, and suffer from scarcity of accurately labeled training data for data science tasks.

Method: Created DARE-bench with 6,300 Kaggle-derived tasks covering a broad range of data science operations, providing verifiable ground truth for objective evaluation and large-scale training/evaluation sets.

Result: Even highly capable models like GPT-4-mini struggle, especially on ML modeling tasks. Fine-tuning with DARE-bench data substantially improves performance: Qwen3-32B accuracy increased 1.83x with supervised fine-tuning, and Qwen3-4B accuracy improved over 8x with reinforcement learning.

Conclusion: DARE-bench serves as both an accurate evaluation benchmark and critical training data source for improving LLM performance on data science instruction following tasks.

Abstract: The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B’s accuracy by 1.83x and reinforcement learning boosts Qwen3-4B’s accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.

[282] Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Main category: cs.AI

TL;DR: Proposes OVMSE framework for Offline-to-Online Multi-Agent Reinforcement Learning with value function memory and sequential exploration to address distributional shift and exploration challenges in multi-agent settings.

DetailsMotivation: Existing Offline-to-Online RL research focuses on single-agent settings, with limited exploration of multi-agent extension (O2O MARL). Two critical challenges emerge in multi-agent settings: (1) risk of unlearning pre-trained Q-values due to distributional shifts during offline-to-online transition, and (2) difficulty of efficient exploration in large joint state-action spaces.

Method: Proposes OVMSE framework with two key components: (1) Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving offline knowledge and enabling smooth transitions, and (2) decentralized Sequential Exploration (SE) strategy that utilizes pre-trained offline policy for exploration, reducing joint state-action space.

Result: Extensive experiments on StarCraft Multi-Agent Challenge (SMAC) demonstrate OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

Conclusion: OVMSE effectively addresses key challenges in O2O MARL through value function memory preservation and efficient sequential exploration, showing promising results in complex multi-agent environments.

Abstract: Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

[283] CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig

Main category: cs.AI

TL;DR: CowPilot is a human-agent collaborative framework for web navigation that allows users to interject, override, or resume agent control during task execution, achieving 95% success rate with humans performing only 15.2% of steps.

DetailsMotivation: Current web agents often fail on complex real-world tasks and struggle with user preference modeling, creating a need for human-agent collaboration frameworks that leverage both human judgment and agent capabilities.

Method: CowPilot enables autonomous and collaborative web navigation where agents propose next steps while users can pause, reject, or take alternative actions. Users can interleave their actions by overriding suggestions or resuming agent control as needed.

Result: Case studies on five common websites show the human-agent collaborative mode achieves 95% success rate with humans performing only 15.2% of total steps. Even with interventions, agents drive up to half of task success independently.

Conclusion: CowPilot serves as a useful tool for data collection and agent evaluation across websites, enabling research into effective human-agent collaboration in web navigation tasks.

Abstract: While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent’s capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

[284] Language Models as Messengers: Enhancing Message Passing in Heterophilic Graph Learning

Dawei Cheng, Wenjun Wang, Mingjian Guang

Main category: cs.AI

TL;DR: LEMP4HG is a language model-enhanced message passing approach for heterophilic graph learning that uses LMs to model semantic relationships from node texts and selectively enhances messages for critical node pairs.

DetailsMotivation: Standard GNN message passing assumes messages can be represented by source node embeddings, which fails in heterophilic graphs where connected nodes have different classes. Existing methods overlook the semantic potential of node text and compromise performance on homophilic graphs.

Method: Proposes LEMP4HG: 1) Uses language models to explicitly model inter-node semantic relationships from paired node texts, synthesizing semantically informed messages for propagation, 2) Introduces active learning-inspired strategy with MVRD heuristic to selectively enhance messages for node pairs most affected by message passing, ensuring practical efficiency.

Result: Extensive experiments show LEMP4HG consistently outperforms state-of-the-art methods on heterophilic graphs while maintaining robust performance on homophilic graphs under practical computational budget.

Conclusion: LEMP4HG effectively addresses heterophily in graph learning by leveraging language models for semantic message representation and selective enhancement, achieving superior performance across both heterophilic and homophilic graphs.

Abstract: Graph neural networks (GNNs) have become a standard paradigm for graph representation learning, yet their message passing mechanism implicitly assumes that messages can be represented by source node embeddings, an assumption that fails in heterophilic graphs. While existing methods attempt to address heterophily through graph structure refinement or adaptation of neighbor aggregation, they often overlook the semantic potential of node text, relying on suboptimal message representation for propagation and compromise performance on homophilic graphs. To address these limitations, we propose LEMP4HG, a novel language model (LM)-enhanced message passing approach for heterophilic graph learning. Specifically, for text-attributed graphs (TAG), we leverage a LM to explicitly model inter-node semantic relationships from paired node texts, synthesizing semantically informed messages for propagation. To ensure practical efficiency, we further introduce an active learning-inspired strategy guided by a tailored heuristic, MVRD, which selectively enhances messages for node pairs most affected by message passing. Extensive experiments demonstrate that LEMP4HG consistently outperforms state-of-the-art methods on heterophilic graphs while maintaining robust performance on homophilic graphs under a practical computational budget.

[285] CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

Main category: cs.AI

TL;DR: CoMind is a multi-agent system that leverages collective knowledge from simulated Kaggle communities to automate ML engineering, achieving top performance in competitions.

DetailsMotivation: Current LLM agents operate in isolation without engaging with research communities, missing the collaborative knowledge-sharing that human researchers benefit from.

Method: Introduces MLE-Live evaluation framework and CoMind multi-agent system with iterative parallel exploration to develop multiple solutions simultaneously, balancing breadth and depth.

Result: Achieved 36% medal rate on 75 past Kaggle competitions, and in 8 live competitions outperformed 92.6% of human competitors on average, with top 5% placement in three competitions and top 1% in one.

Conclusion: CoMind demonstrates that systematic knowledge leveraging through multi-agent collaboration can significantly advance automated ML engineering, bridging the gap between isolated agents and community-driven research.

Abstract: Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent’s ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, a multi-agent system designed to systematically leverage external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.

[286] MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM

Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, Wei Wei, Kun Zhang, Shaohua Kevin Zhou

Main category: cs.AI

TL;DR: Proposes MACD, a multi-agent framework for clinical diagnosis where LLMs self-learn medical knowledge through experience accumulation, achieving significant accuracy improvements over clinical guidelines and comparable/superior performance to physicians.

DetailsMotivation: Current LLMs struggle with complex real-world clinical diagnoses using conventional prompting methods, as they optimize isolated inferences without accumulating reusable clinical experience like physicians do through practice.

Method: MACD framework uses a multi-agent pipeline where LLM agents summarize, refine, and apply diagnostic insights through iterative consultations. Includes diagnostician agents, evaluator agent, and human oversight for unresolved cases, enabling self-learning of clinical knowledge.

Result: Evaluated on 4,390 real-world cases across 7 diseases using Llama-3.1 and DeepSeek models. MACD improved primary diagnostic accuracy up to 22.3% over clinical guidelines, achieved comparable/superior performance to physicians (up to 16% improvement), and MACD-human workflow yielded 18.6% improvement over physician-only diagnosis.

Conclusion: MACD presents a scalable self-learning paradigm that bridges LLMs’ intrinsic knowledge with clinical expertise accumulation, demonstrating strong cross-model stability, transferability, and personalization potential for human-AI collaboration in medical diagnosis.

Abstract: Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). In direct comparison with physician-only diagnosis under the same evaluation protocol, MACD achieves comparable or superior performance, with improvements up to 16%. Furthermore, the MACD-human workflow yields an 18.6% improvement over physician-only diagnosis, demonstrating the synergistic potential of human-AI collaboration. Notably, the self-learned clinical knowledge exhibits strong cross-model stability, transferability across LLMs, and capacity for model-specific personalization.This work thus presents a scalable self-learning paradigm that bridges the gap between the intrinsic knowledge of LLMs.

[287] Demystifying the Lifecycle of Failures in Platform-Orchestrated Agentic Workflows

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang

Main category: cs.AI

TL;DR: Empirical study of failure modes in low-code orchestrated agentic workflows, analyzing 307 real-world failure cases to characterize patterns, root causes, and repair strategies.

DetailsMotivation: Agentic workflows on low-code platforms enable rapid multi-agent system development but introduce poorly understood failure modes that propagate across heterogeneous nodes through natural-language interactions, tool invocations, and dynamic control logic, making failure attribution and repair challenging.

Method: Presents AgentFail dataset of 307 real-world failure cases from two representative agentic workflow platforms. Analyzes failure patterns, root causes, and repair difficulty for various failure root causes and workflow nodes from a failure lifecycle perspective.

Result: Reveals key failure mechanisms in agentic workflows and provides actionable guidelines for reliable failure repair and real-world agentic workflow design based on empirical analysis of failure patterns.

Conclusion: Provides empirical understanding of agentic workflow failures, offering practical insights for improving reliability and maintainability of multi-agent systems built on low-code orchestration platforms.

Abstract: Agentic workflows built on low-code orchestration platforms enable rapid development of multi-agent systems, but they also introduce new and poorly understood failure modes that hinder reliability and maintainability. Unlike traditional software systems, failures in agentic workflows often propagate across heterogeneous nodes through natural-language interactions, tool invocations, and dynamic control logic, making failure attribution and repair particularly challenging. In this paper, we present an empirical study of platform-orchestrated agentic workflows from a failure lifecycle perspective, with the goal of characterizing failure manifestations, identifying underlying root causes, and examining corresponding repair strategies. We present AgentFail, a dataset of 307 real-world failure cases collected from two representative agentic workflow platforms. Based on this dataset, we analyze failure patterns, root causes, and repair difficulty for various failure root causes and nodes in the workflow. Our findings reveal key failure mechanisms in agentic workflows and provide actionable guidelines for reliable failure repair, and real-world agentic workflow design.

[288] RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu

Main category: cs.AI

TL;DR: RE-PO is a robust preference alignment framework that addresses label noise in human feedback data through expectation-maximization and adaptive reweighting, improving existing alignment methods.

DetailsMotivation: Standard preference alignment methods like RLHF assume clean preference data, but real-world datasets contain substantial noise from annotator mistakes, inconsistent instructions, varying expertise, and adversarial feedback, which can degrade model performance.

Method: RE-PO uses an expectation-maximization procedure to infer posterior correctness of each label and adaptively reweights data points in the training loss to mitigate label noise. It establishes a theoretical link between preference losses and probabilistic models to transform existing alignment algorithms into robust counterparts.

Result: RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO), increasing AlpacaEval 2 win rates by up to 7.0% over baselines when applied to Mistral and Llama 3 models.

Conclusion: RE-PO provides a general framework for robust preference alignment that addresses label noise issues in real-world datasets, theoretically recovering true noise levels and empirically improving model performance across multiple alignment methods.

Abstract: Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.

[289] MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information

Jiaxi Li, Yucheng Shi, Xiao Huang, Jin Lu, Ninghao Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.03632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] Reallocating Attention Across Layers to Reduce Multimodal Hallucination

Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.10285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] Automating the Refinement of Reinforcement Learning Specifications

Tanmay Ambadkar, Đorđe Žikelić, Abhinav Verma

Main category: cs.AI

TL;DR: Unable to analyze paper 2512.01047 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2512.01047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting

Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.02814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.03005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors

Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, Huan Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions about paper content due to technical error

Abstract: Failed to fetch summary for 2601.21961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] Real-Time Aligned Reward Model beyond Semantics

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

DetailsMotivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting error

Method: Method unknown - arXiv API returned HTTP 429 (Too Many Requests) error

Result: No results available due to failed API request

Conclusion: Cannot analyze paper content due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2601.22664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.08354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Haibo Tong, Feifei Zhao, Linghao Feng, Ruoyu Wu, Ruolin Chen, Lu Jia, Zhou Zhao, Jindong Li, Tenglong Li, Erliang Lin, Shuai Yang, Enmeng Lu, Yinqian Sun, Qian Zhang, Zizhe Ruan, Jinyu Fan, Zeyang Yue, Ping Wu, Huangrui Li, Chengyi Sun, Yi Zeng

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.14135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

Seoyoung Lee, Seobin Yoon, Seongbeen Lee, Yoojung Chun, Dayoung Park, Doyeon Kim, Joo Yong Sim

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.17049 appears to be from February 2025, but no content is available for analysis.

DetailsMotivation: Cannot determine motivation due to missing paper content.

Method: Cannot determine method due to missing paper content.

Result: Cannot determine results due to missing paper content.

Conclusion: Cannot draw conclusions due to missing paper content.

Abstract: Failed to fetch summary for 2602.17049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] Robust and Efficient Tool Orchestration via Layered Execution Structures with Reflective Correction

Tao Zhe, Haoyu Wang, Bo Luo, Min Wu, Wei Fan, Xiao Luo, Zijun Yao, Haifeng Chen, Dongjie Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to data unavailability

Abstract: Failed to fetch summary for 2602.18968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation

Abeer Dyoub, Francesca A. Lisi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] Multi-Level Causal Embeddings

Willem Schooltink, Fabio Massimo Zennaro

Main category: cs.AI

TL;DR: Paper ID 2602.22287 could not be analyzed due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to draw conclusions due to failed summary fetch

Abstract: Failed to fetch summary for 2602.22287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

Yongjun Zhang

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2602.22401 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.22401: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22401&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

Joseph Tso, Preston Schmittou, Quan Huynh, Jibran Hutchins

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2602.22465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xun Huang, Simeng Qin, Xiaoshuang Jia, Ranjie Duan, Huanqian Yan, Zhitao Zeng, Fei Yang, Yang Liu, Xiaojun Jia

Main category: cs.AI

TL;DR: Paper 2602.22983: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2602.22983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks

Haohui Jia, Zheng Chen, Lingwei Zhu, Rikuto Kotoge, Jathurshan Pradeepkumar, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai, Takashi Matsubara

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.23285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] A blockchain-based intelligent recommender system framework for enhancing supply chain resilience

Yang Hu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2404.00306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.00306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] LLM-hRIC: LLM-empowered Hierarchical RAN Intelligent Control for O-RAN

Lingyan Bao, Sinwoong Yun, Jemin Lee, Tony Q.S. Quek

Main category: cs.AI

TL;DR: Unable to analyze paper 2504.18062 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot determine conclusion as abstract is unavailable

Abstract: Failed to fetch summary for 2504.18062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Rui Liu, Rui Xie, Zijun Yao, Yanjie Fu, Dongjie Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.11601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] Fairness-in-the-Workflow: How Machine Learning Practitioners at Big Tech Companies Approach Fairness in Recommender Systems

Jing Nathan Yan, Emma Harvey, Junxiong Wang, Jeffrey M. Rzeszotarski, Allison Koenecke

Main category: cs.AI

TL;DR: Paper 2505.19441: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2505.19441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

Patara Trirat, Wonyong Jeong, Sung Ju Hwang

Main category: cs.AI

TL;DR: Unable to analyze paper 2505.19764 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2505.19764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] Representing local protein environments with atomistic foundation models

Meital Bojan, Sanketh Vedula, Advaith Maddipatla, Nadav Bojan Sellam, Federico Napoli, Paul Schanda, Alex M. Bronstein

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2505.23354: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23354&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

Théo Vincent, Yogesh Tripathi, Tim Faust, Abdullah Akgül, Yaniv Oren, Melih Kandemir, Jan Peters, Carlo D’Eramo

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.04398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] Estimating Treatment Effects with Independent Component Analysis

Patrik Reizinger, Lester Mackey, Wieland Brendel, Rahul Krishnan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2507.16467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] Approximate SMT Counting Beyond Discrete Domains

Arijit Shaw, Kuldeep S. Meel

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.18612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Yeong-Joon Ju, Seong-Whan Lee

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.00955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] OM2P: Offline Multi-Agent Mean-Flow Policy

Zhuoran Li, Xun Wang, Hai Zhong, Qingxin Xia, Lihua Zhang, Longbo Huang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.06269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs

Arjun Ashok, Andrew Robert Williams, Vincent Zhihao Zheng, Irina Rish, Nicolas Chapados, Étienne Marcotte, Valentina Zantedeschi, Alexandre Drouin

Main category: cs.AI

TL;DR: Unable to analyze paper 2508.09904 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2508.09904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Jiarui Yang, Bin Zhu, Jingjing Chen, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2508.11143 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2508.11143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

Ron Solomon, Yarin Yerushalmi Levi, Lior Vaknin, Eran Aizikovich, Amit Baras, Etai Ohana, Amit Giloni, Shamik Bose, Chiara Picardi, Yuval Elovici, Asaf Shabtai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.12412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] A Reduction of Input/Output Logics to SAT

Alexander Steen

Main category: cs.AI

TL;DR: Paper 2508.16242: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.16242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators

Maolin Sun, Yibiao Yang, Yuming Zhou

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2508.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Zhengkang Guan, Kun Kuang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.21021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] Context and Diversity Matter: The Emergence of In-Context Learning in World Models

Fan Wang, Zhiyuan Chen, Yuxuan Zhong, Sunjian Zheng, Pengtao Shao, Bo Yu, Shaoshan Liu, Jianan Wang, Ning Ding, Yang Cao, Yu Kang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.22353: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22353&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] BEV-VLM: Trajectory Planning via Unified BEV Abstraction

Guancheng Chen, Sheng Yang, Tong Zhan, Jian Wang

Main category: cs.AI

TL;DR: Paper 2509.25249: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2509.25249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning

Marcel Wienöbst, Leonard Henckel, Sebastian Weichwald

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.04970 suggests it’s from October 2024, but content is unavailable for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: Cannot determine method without access to the paper content. The error prevents retrieval of any technical details.

Result: Cannot determine results without access to the paper content. The paper analysis is impossible due to API limitations.

Conclusion: Cannot draw conclusions about the paper’s content. The arXiv API rate limiting prevents proper analysis of paper 2510.04970.

Abstract: Failed to fetch summary for 2510.04970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.05228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection

Rui Liu, Tao Zhe, Yanjie Fu, Feng Xia, Ted Senator, Dongjie Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to draw conclusions due to data fetch failure

Abstract: Failed to fetch summary for 2510.05535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] Carré du champ flow matching: better quality-generalisation tradeoff in generative models

Jacob Bamberger, Iolo Jones, Dennis Duncan, Michael M. Bronstein, Pierre Vandergheynst, Adam Gosztolai

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2510.05930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] DropVLA: An Action-Level Backdoor Attack on Vision–Language–Action Models

Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2510.10932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] Thompson Sampling via Fine-Tuning of LLMs

Nicolas Menet, Aleksandar Terzić, Michael Hersche, Andreas Krause, Abbas Rahimi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2510.13328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.13358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules

Tianwei Wang, Xinhui Ma, Wei Pang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.16607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] User Misconceptions of LLM-Based Conversational Programming Assistants

Gabrielle O’Brien, Antonio Pedro Santos Alves, Sebastian Baltes, Grischa Liebel, Mircea Lungu, Marcos Kalinowski

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.25662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] NuBench: An Open Benchmark for Deep Learning-Based Event Reconstruction in Neutrino Telescopes

Rasmus F. Orsoe, Stephan Meighen-Berger, Jeffrey Lazar, Jorge Prado, Ivan Mozun-Mateo, Aske Rosted, Philip Weigel, Arturo Llorente Anaya

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.13111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

Vaibhav Singh, Oleksiy Ostapenko, Pierre-André Noël, Eugene Belilovsky, Torsten Scholak

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.15927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] MEDIC: a network for monitoring data quality in collider experiments

Juvenal Bassa, Arghya Chattopadhyay, Sudhir Malik, Mario Escabi Rivera

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.18172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation

Tao Zhe, Huazhen Fang, Kunpeng Liu, Qian Lou, Tamzidul Hoque, Dongjie Wang

Main category: cs.AI

TL;DR: Paper ID 2511.21934 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2511.21934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Joint Estimation of Sea State and Vessel Parameters Using a Mass-Spring-Damper Equivalence Model

Ranjeet K. Tiwari, Daniel Sgarioto, Peter Graham, Alexei Skvortsov, Sanjeev Arulampalam, Damith C. Ranasinghe

Main category: cs.AI

TL;DR: Paper ID 2511.21997: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot draw conclusions due to missing abstract

Abstract: Failed to fetch summary for 2511.21997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] VCWorld: A Biological World Model for Virtual Cell Simulation

Zhijian Wei, Runze Ma, Zichen Wang, Zhongmin Li, Shuotong Song, Shuangjia Zheng

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.00306 suggests it’s from December 2024, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2512.00306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] Rough Sets for Explainability of Spectral Graph Clustering

Bartłomiej Starosta, Sławomir T. Wierzchoń, Piotr Borkowski, Dariusz Czerski, Marcin Sydow, Eryk Laskowski, Mieczysław A. Kłopotek

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.12436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] QKAN-LSTM: Quantum-inspired Kolmogorov-Arnold Long Short-term Memory

Yu-Chao Hsu, Jiun-Cheng Jiang, Chun-Hua Lin, Kuo-Chung Peng, Nan-Yow Chen, Samuel Yen-Chi Chen, En-Jui Kuo, Hsi-Sheng Goan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.05049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.17131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] WisPaper: Your AI Scholar Search Engine

Li Ju, Jun Zhao, Mingxu Chai, Ziyu Shen, Xiangyang Wang, Yage Geng, Chunchun Ma, Hao Peng, Guangbin Li, Tao Li, Chengyong Liao, Fu Wang, Xiaolong Wang, Junshen Chen, Rui Gong, Shijia Liang, Feiyan Li, Ming Zhang, Kexin Tan, Junjie Ye, Zhiheng Xi, Shihan Dou, Tao Gui, Yuankai Ying, Yang Shi, Yue Zhang, Qi Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.06879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2512.23075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] LIA: Supervised Fine-Tuning of Large Language Models for Automatic Issue Assignment

Arsham Khosravani, Alireza Hoseinpour, Arshia Akhavan, Mehdi Keshani, Abbas Heydarnoori

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available due to API access restrictions

Result: No results available - paper content could not be retrieved

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2601.01780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.21283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] VISTA: Knowledge-Driven Vessel Trajectory Imputation with Repair Provenance

Hengyu Liu, Tianyi Li, Haoyu Wang, Kristian Torp, Tiancheng Zhang, Yushuai Li, Christian S. Jensen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2601.06940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] GenAI-Net: A Generative AI Framework for Automated Biomolecular Network Design

Maurice Filo, Nicolò Rossi, Zhou Fang, Mustafa Khammash

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.17582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control

Quanquan Peng, Yunfeng Lin, Yufei Xue, Jiangmiao Pang, Weinan Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.02960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents

Farnoosh Hashemi, Michael W. Macy

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2602.03775 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper details

Method: Unable to determine method due to technical error fetching paper details

Result: Unable to determine results due to technical error fetching paper details

Conclusion: Unable to draw conclusions due to technical error fetching paper details

Abstract: Failed to fetch summary for 2602.03775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] DECO: Decoupled Multimodal Diffusion Transformer for Bimanual Dexterous Manipulation with a Plugin Tactile Adapter

Xukun Li, Yu Sun, Lei Zhang, Bosheng Huang, Yibo Peng, Yuan Meng, Haojun Jiang, Shaoxuan Xie, Guocai Yao, Alois Knoll, Zhenshan Bing, Xinlong Wang, Zhenguo Sun

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.05513 appears to be from February 2025, but no content is available for analysis.

DetailsMotivation: Cannot determine motivation due to lack of access to paper content.

Method: Cannot determine method due to lack of access to paper content.

Result: Cannot determine results due to lack of access to paper content.

Conclusion: Cannot draw conclusions due to lack of access to paper content.

Abstract: Failed to fetch summary for 2602.05513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.10117 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2602.10117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis

Zhen Bi, Xueshu Chen, Luoyang Sun, Yuhang Yao, Qing Shen, Jungang Lou, Cheng Deng

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2602.11506 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing abstract content

Method: Unable to determine method due to missing abstract content

Result: Unable to determine results due to missing abstract content

Conclusion: Unable to determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2602.11506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan Samuel de Lara, Florian Shkurti

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2602.17632 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.17632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Capabilities Ain’t All You Need: Measuring Propensities in AI

Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tyler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.18182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

Yijiashun Qi, Hanzhe Guo, Yijiazhen Qi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.19591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] Provably Safe Generative Sampling with Constricting Barrier Functions

Darshan Gadginmath, Ahmed Allibhoy, Fabio Pasqualetti

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.21429 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot draw conclusions due to missing abstract content

Abstract: Failed to fetch summary for 2602.21429: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21429&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning

Yicheng Bao, Xuhong Wang, Qiaosheng Zhang, Chaochao Lu, Xia Hu, Xin Tan

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed API request

Method: Cannot determine method due to failed API request

Result: Cannot determine results due to failed API request

Conclusion: Cannot draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2602.22227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to technical limitations

Abstract: Failed to fetch summary for 2602.22291: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22291&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu

Main category: cs.AI

TL;DR: Paper 2602.22437 abstract could not be fetched due to HTTP 429 error (rate limiting). No content available for analysis.

DetailsMotivation: Unable to determine motivation as abstract content is not available due to HTTP 429 error when fetching from arXiv API.

Method: Method cannot be analyzed without access to the paper abstract or content.

Result: No results can be reported since the paper content could not be retrieved.

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the abstract.

Abstract: Failed to fetch summary for 2602.22437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.23296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[362] Hello-Chat: Towards Realistic Social Audio Interactions

Yueran Hou, Peilei Jia, Zihan Sun, Qihang Lu, Wenbing Yang, Yingming Gao, Ya Li, Jun Gao

Main category: cs.SD

TL;DR: Hello-Chat is an end-to-end audio language model that addresses the robotic “read-speech” style in existing LALMs by using real-life conversation data and modality-interleaved training to achieve more natural, emotionally-aligned audio generation.

DetailsMotivation: Existing Large Audio Language Models (LALMs) suffer from a disconnect between perception and expression, resulting in robotic "read-speech" that lacks the spontaneity and emotional resonance of real human interaction. The authors aim to create a more realistic, anthropomorphic audio generation model for social scenarios.

Method: The authors introduce Hello-Chat, an end-to-end audio language model that leverages a massive dataset of real-life conversations and employs a modality-interleaved training strategy to bridge the gap between perception and expression.

Result: Hello-Chat achieves state-of-the-art performance on specific audio understanding tasks and significantly outperforms existing baselines in prosodic naturalness and emotional alignment, demonstrating breakthrough anthropomorphic generation capabilities.

Conclusion: Hello-Chat represents a significant advancement toward more realistic and empathetic AI agents by addressing the perception-expression disconnect in audio language models, paving the way for next-generation conversational AI.

Abstract: Recent advancements in Large Audio Language Models (LALMs) have demonstrated exceptional performance in speech recognition and translation. However, existing models often suffer from a disconnect between perception and expression, resulting in a robotic “read-speech” style that lacks the spontaneity and emotional resonance of real human interaction. In this report, we introduce Hello-Chat, an end-to-end audio language model designed for realistic social scenarios. By leveraging a massive dataset of real-life conversations and employing a modality-interleaved training strategy, Hello-Chat achieves a breakthrough in anthropomorphic generation. Experimental results show that our model not only reaches state-of-the-art (SOTA) performance on specific audio understanding tasks but also significantly outperforms existing baselines in prosodic naturalness and emotional alignment, paving the way for the next generation of empathetic AI agents.

[363] Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

Siyi Xie, Hanxin Zhu, Xinyi Chen, Tianyu He, Xin Li, Zhibo Chen

Main category: cs.SD

TL;DR: Sonic4D is a framework for generating spatial audio synchronized with 4D dynamic scenes, enabling immersive audiovisual experiences by localizing sound sources in 4D scenes and synthesizing physics-based spatial audio.

DetailsMotivation: Existing 4D generation methods focus only on visual synthesis while overlooking spatial audio generation, creating a limitation for truly immersive audiovisual experiences. There's a need to bridge this gap between visual and auditory modalities.

Method: Three-stage framework: 1) Generate 4D scene and monaural audio from monocular video using pre-trained models, 2) Localize and track sound sources in 4D scene via pixel-level visual grounding to estimate 3D coordinates, 3) Synthesize spatial audio using physics-based simulation based on estimated sound source locations.

Result: The method generates realistic spatial audio consistent with synthesized 4D scenes in a training-free manner, significantly enhancing immersive experience. Extensive experiments demonstrate effectiveness.

Conclusion: Sonic4D successfully bridges the gap between visual and auditory modalities in 4D generation, enabling truly immersive audiovisual experiences by synchronizing spatial audio with dynamic 3D scenes.

Abstract: Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.

[364] DashengTokenizer: One layer is enough for unified audio understanding and generation

Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, Yadong Niu, Jizhong Liu, Xiyang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan

Main category: cs.SD

TL;DR: DashengTokenizer is a continuous audio tokenizer that uses frozen semantic features with injected acoustic information for joint audio understanding and generation tasks, outperforming previous methods across diverse tasks.

DetailsMotivation: Current audio tokenizers typically train acoustic tokenizers first and then integrate semantic knowledge. This paper proposes inverting this paradigm by leveraging frozen semantic features and injecting acoustic information to create a more effective joint understanding and generation tokenizer.

Method: The method develops a continuous audio tokenizer that uses frozen semantic features as a foundation and injects acoustic information into this structure. This approach contrasts with conventional VAE-based architectures and standard acoustic tokenizer training pipelines.

Result: Outperforms previous audio codec and encoder baselines across 22 diverse tasks in linear evaluation, maintains competitive audio reconstruction quality, and shows improved performance on speech emotion recognition, music understanding, and acoustic scene classification. Also surpasses VAE-based methods on text-to-audio and text-to-music tasks while being effective on speech enhancement.

Conclusion: DashengTokenizer demonstrates that acoustic injection into frozen semantic features creates an effective joint audio understanding and generation tokenizer, challenging the assumption that VAE-based architectures are necessary for audio synthesis.

Abstract: This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer’s generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.

[365] Leveraging large multimodal models for audio-video deepfake detection: a pilot study

Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

Main category: cs.SD

TL;DR: AV-LMMDetect: A large multimodal model fine-tuned for audio-visual deepfake detection using Qwen 2.5 Omni, achieving state-of-the-art performance on benchmark datasets.

DetailsMotivation: Current multimodal deepfake detectors are small, task-specific models that work well on curated tests but scale poorly and generalize weakly across domains. There's a need for more robust, scalable solutions for audio-visual deepfake detection.

Method: Fine-tunes Qwen 2.5 Omni as a large multimodal model for audio-visual deepfake detection, casting it as a prompted yes/no classification (“Is this video real or fake?”). Uses two-stage training: lightweight LoRA alignment followed by full fine-tuning of audio-visual encoders.

Result: Matches or surpasses prior methods on FakeAVCeleb and Mavos-DD datasets, setting new state-of-the-art performance on Mavos-DD datasets.

Conclusion: AV-LMMDetect demonstrates that large multimodal models can be effectively adapted for audio-visual deepfake detection, offering better scalability and generalization than small task-specific models.

Abstract: Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - “Is this video real or fake?”. Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

[366] AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang

Main category: cs.SD

TL;DR: AudioCapBench is a benchmark for evaluating audio captioning capabilities of large multimodal models across environmental sound, music, and speech domains with 1,000 curated samples.

DetailsMotivation: There's a need for standardized evaluation of audio captioning capabilities in large multimodal models to assess their understanding and generation of textual descriptions from audio inputs across different domains.

Method: Created a benchmark with 1,000 curated samples from established datasets covering three audio domains (environmental sound, music, speech). Evaluated 13 models from OpenAI and Google Gemini using both traditional reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework that scores predictions on accuracy, completeness, and hallucination dimensions.

Result: Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini 3 Pro achieving the highest overall score (6.00/10). OpenAI models exhibit lower hallucination rates. All models perform best on speech captioning and worst on music captioning.

Conclusion: AudioCapBench provides a standardized evaluation framework for audio captioning in multimodal models, revealing performance differences between model families and highlighting music captioning as the most challenging domain. The benchmark and evaluation code are released to facilitate reproducible audio understanding research.

Abstract: We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation samples drawn from established datasets. We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework that scores predictions on three orthogonal dimensions: \textit{accuracy} (semantic correctness), \textit{completeness} (coverage of reference content), and \textit{hallucination} (absence of fabricated content). Our results reveal that Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini3Pro achieving the highest overall score (6.00/10), while OpenAI models exhibit lower hallucination rates. All models perform best on speech captioning and worst on music captioning. We release the benchmark as well as evaluation code to facilitate reproducible audio understanding research.

[367] Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

Keita Goto, Takashi Maekaku, Jin Sakuma, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe

Main category: cs.SD

TL;DR: Online registers improve streaming speech recognition by acting as virtual placeholders for future context, reducing performance gap between offline and online modes.

DetailsMotivation: Dual-mode self-supervised speech models suffer from attention mismatch in streaming scenarios due to missing future context, creating performance gaps between offline and online modes.

Method: Proposed online registers - learnable tokens appended to each chunk in online mode that act as virtual placeholders for unseen future frames. Also introduced future prediction loss to guide registers to capture predictive cues.

Result: Online registers consistently reduce performance gap between offline and online modes, achieving 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially effective in low-latency settings.

Conclusion: Online registers effectively address attention mismatch in streaming speech recognition without introducing additional latency, improving performance in low-latency scenarios.

Abstract: Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.

[368] SHINE: Sequential Hierarchical Integration Network for EEG and MEG

Xiran Xu, Yujie Yan, Xihong Wu, Jing Chen

Main category: cs.SD

TL;DR: SHINE network for MEG-based speech detection achieves state-of-the-art performance in LibriBrain Competition 2025 using hierarchical neural architecture and ensemble methods.

DetailsMotivation: Understanding how natural speech is represented in the brain is a major neuroscience challenge, with cortical envelope-following responses being crucial for speech decoding. The LibriBrain Competition provides a platform to advance MEG-based speech detection methods.

Method: Proposed Sequential Hierarchical Integration Network (SHINE) for EEG/MEG to reconstruct binary speech-silence sequences from MEG signals. In Extended Track, incorporated auxiliary reconstructions of speech envelopes and Mel spectrograms. Used ensemble methods combining SHINE with baselines (BrainMagic, AWavNet, ConvConcatNet).

Result: Achieved F1-macro scores of 0.9155 (Standard Track) and 0.9184 (Extended Track) on leaderboard test set, demonstrating state-of-the-art performance in MEG-based speech detection.

Conclusion: SHINE network with ensemble methods effectively decodes speech presence from MEG signals, advancing brain-computer interface applications for speech understanding and neural decoding.

Abstract: How natural speech is represented in the brain constitutes a major challenge for cognitive neuroscience, with cortical envelope-following responses playing a central role in speech decoding. This paper presents our approach to the Speech Detection task in the LibriBrain Competition 2025, utilizing over 50 hours of magnetoencephalography (MEG) signals from a single participant listening to LibriVox audiobooks. We introduce the proposed Sequential Hierarchical Integration Network for EEG and MEG (SHINE) to reconstruct the binary speech-silence sequences from MEG signals. In the Extended Track, we further incorporated auxiliary reconstructions of speech envelopes and Mel spectrograms to enhance training. Ensemble methods combining SHINE with baselines (BrainMagic, AWavNet, ConvConcatNet) achieved F1-macro scores of 0.9155 (Standard Track) and 0.9184 (Extended Track) on the leaderboard test set.

[369] SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away

Jiajia Li, Jiliang Hu, Ziyi Pan, Chong Chen, Zuchao Li, Ping Wang, Lefei Zhang

Main category: cs.SD

TL;DR: SongSong is the first music generation model for restoring ancient Chinese SongCi music, using a pipeline approach to predict melody from text, then generate singing voice and accompaniment separately.

DetailsMotivation: Existing music generation models focus on modern pop songs but struggle with ancient music like Chinese SongCi which has distinct rhythms and styles. There's also a lack of ancient music datasets.

Method: Three-stage pipeline: 1) Predict melody from input SongCi text, 2) Generate singing voice based on predicted melody, 3) Generate accompaniment based on melody, then combine all elements. Created OpenSongSong dataset (29.9 hours of ancient SongCi music).

Result: Evaluated on 85 unseen SongCi sentences against Suno and SkyMusic. Both subjective and objective results show SongSong achieves leading performance in generating high-quality SongCi music.

Conclusion: SongSong successfully addresses the gap in ancient music generation, particularly for Chinese SongCi, demonstrating superior performance compared to existing music generation platforms.

Abstract: Recently, there have been significant advancements in music generation. However, existing models primarily focus on creating modern pop songs, making it challenging to produce ancient music with distinct rhythms and styles, such as ancient Chinese SongCi. In this paper, we introduce SongSong, the first music generation model capable of restoring Chinese SongCi to our knowledge. Our model first predicts the melody from the input SongCi, then separately generates the singing voice and accompaniment based on that melody, and finally combines all elements to create the final piece of music. Additionally, to address the lack of ancient music datasets, we create OpenSongSong, a comprehensive dataset of ancient Chinese SongCi music, featuring 29.9 hours of compositions by various renowned SongCi music masters. To assess SongSong’s proficiency in performing SongCi, we randomly select 85 SongCi sentences that were not part of the training set for evaluation against SongSong and music generation platforms such as Suno and SkyMusic. The subjective and objective outcomes indicate that our proposed model achieves leading performance in generating high-quality SongCi music.

[370] TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

Mohan Xu, Kai Li, Guo Chen, Xiaolin Hu

Main category: cs.SD

TL;DR: TIGER is an efficient speech separation model with 94.3% fewer parameters and 95.3% lower computational cost than SOTA models, plus EchoSet dataset with realistic acoustic conditions for better generalization.

DetailsMotivation: Current speech separation research focuses too much on performance improvement while neglecting efficiency, which is crucial for low-latency systems. There's also a need for more realistic evaluation datasets that include complex acoustic environments like noise, reverberation, and object occlusions.

Method: Proposes TIGER (Time-frequency Interleaved Gain Extraction and Reconstruction network) that leverages prior knowledge for frequency band division and compression. Uses multi-scale selective attention for contextual features and full-frequency-frame attention to capture temporal and frequency context. Also introduces EchoSet dataset with realistic reverberation (considering object occlusions/material properties) and random speaker overlap proportions.

Result: TIGER achieves 94.3% parameter reduction and 95.3% MACs reduction while surpassing SOTA TF-GridNet performance on EchoSet and real-world data. Models trained on EchoSet show better generalization to physical world data than other datasets.

Conclusion: TIGER demonstrates that speech separation models can be made highly efficient without sacrificing performance. EchoSet provides more realistic evaluation conditions that improve model generalization to real-world acoustic environments.

Abstract: In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets compared to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing the state-of-the-art (SOTA) model TF-GridNet.

[371] VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu

Main category: cs.SD

TL;DR: VoiceBridge is a one-step latent bridge model for general speech restoration that reconstructs 48kHz fullband speech from diverse distortions using a single latent-to-latent generative process.

DetailsMotivation: Existing speech enhancement bridge models are mostly single-task with limited general speech restoration capability. There's a need for a unified model that can handle various speech restoration tasks efficiently without task-specific distillation.

Method: Proposes VoiceBridge with: 1) Energy-preserving variational autoencoder for better waveform-latent alignment, 2) Single latent-to-latent generative process using scalable transformer for multiple GSR tasks, 3) Joint neural prior to reduce burden from different low-quality priors, 4) Joint training of LBM, decoder and discriminator without distillation.

Result: Superior performance across in-domain tasks (denoising, super-resolution) and out-of-domain tasks (refining synthesized speech) on various datasets, demonstrating effective one-step general speech restoration.

Conclusion: VoiceBridge enables efficient one-step general speech restoration for 48kHz fullband speech from diverse distortions using a unified latent bridge model approach without requiring task-specific distillation.

Abstract: Bridge models have been investigated in speech enhancement but are mostly single-task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one-step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data-domain bridge models, we design an energy-preserving variational autoencoder, enhancing the waveform-latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~\textit{various} GSR tasks with a~\textit{single} latent-to-latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high-quality target from distinctively different low-quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator together, transforming the model from a denoiser to generator and enabling \textit{one-step GSR without distillation}. Extensive validation across in-domain (\textit{e.g.}, denoising and super-resolution) and out-of-domain tasks (\textit{e.g.}, refining synthesized speech) and datasets demonstrates the superior performance of VoiceBridge. Demos: https://VoiceBridgedemo.github.io/.

[372] DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Ziyu Luo, Lin Chen, Qiang Qu, Xiaoming Chen, Yiran Shen

Main category: cs.SD

TL;DR: DynFOA generates realistic first-order ambisonics (FOA) spatial audio from 360-degree videos using dynamic acoustic perception and conditional diffusion, addressing complex acoustic effects like occlusion, reflections, and reverberation.

DetailsMotivation: Current methods for generating spatial audio from 360-degree videos fail to handle dynamic sound sources and complex environmental acoustic effects (occlusion, reflections, reverberation) influenced by scene geometries and materials, limiting immersive VR experiences.

Method: Uses video encoder to detect/localize dynamic sound sources, estimate depth/semantics, and reconstruct 3D scene geometry/materials via Gaussian Splatting. Audio encoder captures spatial motion and 4D sound trajectories to fine-tune diffusion-based FOA generator that adjusts spatial cues in real-time.

Result: Extensive evaluations show DynFOA outperforms existing methods in spatial accuracy, acoustic fidelity, and distribution matching, while improving user experience for VR/immersive media applications.

Conclusion: DynFOA provides robust, scalable approach for rendering realistic dynamic spatial audio from 360-degree videos, addressing key limitations in current spatial audio generation for immersive media.

Abstract: Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener’s viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.

cs.LG

[373] Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite

Main category: cs.LG

TL;DR: REPO is a new method for detoxifying LLMs that uses representation erasure at the token level to make toxic continuations converge toward benign ones, achieving superior robustness against adversarial attacks.

DetailsMotivation: Current LLM detoxification methods (DPO, NPO) are superficial and vulnerable to adversarial prompting and fine-tuning attacks. They don't remove harmful directions in representations, making them ineffective against sophisticated threats.

Method: REPO reformulates detoxification as token-level preference optimization. It uses a novel objective with preference data to force representations of toxic continuations to converge toward their benign counterparts, inducing deep, localized edits to toxicity-encoding neurons.

Result: REPO achieves state-of-the-art robustness, stopping sophisticated threats including relearning attacks and enhanced GCG jailbreaks where existing representation- and output-based methods fail. It preserves general model utility while making deep edits.

Conclusion: REPO provides a more robust approach to LLM detoxification by working at the representation level rather than just output probabilities, making it resistant to adversarial attacks that bypass surface-level defenses.

Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful “directions” remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

[374] U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation

Zezheng Wu, Rui Wang, Xinghe Cheng, Yang Shao, Qing Yang, Jiapu Wang, Jingwei Zhang

Main category: cs.LG

TL;DR: U-CAN is a precision unlearning framework for generative recommendation LLMs that selectively attenuates sensitive data in low-rank adapters while preserving utility, addressing the polysemy dilemma in machine unlearning.

DetailsMotivation: Fine-tuning LLMs for generative recommendation encodes sensitive user attributes into model parameters, creating privacy risks. Existing machine unlearning methods struggle with the polysemy dilemma where neurons contain both sensitive data and general reasoning patterns, leading to catastrophic utility loss when using traditional gradient or pruning approaches.

Method: U-CAN operates on low-rank adapters (LoRA) and uses utility-aware contrastive attenuation. It quantifies risk by contrasting activations between forgetting and retention sets, identifies neurons with asymmetric responses, and applies adaptive soft attenuation with a differentiable decay function to selectively down-scale high-risk parameters while preserving network connectivity.

Result: Experiments on two public datasets across seven metrics show U-CAN achieves strong privacy forgetting, utility retention, and computational efficiency compared to existing methods.

Conclusion: U-CAN provides an effective solution for privacy-preserving generative recommendation by enabling precise unlearning of sensitive data while maintaining model utility, addressing the fundamental polysemy dilemma in machine unlearning for LLMs.

Abstract: Generative Recommendation (GenRec) typically leverages Large Language Models (LLMs) to redefine personalization as an instruction-driven sequence generation task. However, fine-tuning on user logs inadvertently encodes sensitive attributes into model parameters, raising critical privacy concerns. Existing Machine Unlearning (MU) techniques struggle to navigate this tension due to the Polysemy Dilemma, where neurons superimpose sensitive data with general reasoning patterns, leading to catastrophic utility loss under traditional gradient or pruning methods. To address this, we propose Utility-aware Contrastive AttenuatioN (U-CAN), a precision unlearning framework that operates on low-rank adapters. U-CAN quantifies risk by contrasting activations and focuses on neurons with asymmetric responses that are highly sensitive to the forgetting set but suppressed on the retention set. To safeguard performance, we introduce a utility-aware calibration mechanism that combines weight magnitudes with retention-set activation norms, assigning higher utility scores to dimensions that contribute strongly to retention performance. Unlike binary pruning, which often fragments network structure, U-CAN develop adaptive soft attenuation with a differentiable decay function to selectively down-scale high-risk parameters on LoRA adapters, suppressing sensitive retrieval pathways and preserving the topological connectivity of reasoning circuits. Experiments on two public datasets across seven metrics demonstrate that U-CAN achieves strong privacy forgetting, utility retention, and computational efficiency.

[375] Long Range Frequency Tuning for QML

Michael Poppel, Jonas Stein, Sebastian Wölckert, Markus Baumann, Claudia Linnhoff-Popien

Main category: cs.LG

TL;DR: Quantum machine learning with angle encoding can approximate functions via Fourier series, but trainable-frequency methods fail when target frequencies exceed reachable range; ternary grid initialization solves this with dense integer spectra.

DetailsMotivation: Trainable-frequency quantum encoding theoretically offers efficiency by matching target spectrum size, but practical gradient optimization fails when target frequencies exceed the limited reachable range of +/-1 units, causing optimization failures.

Method: Proposes grid-based initialization using ternary encodings that generate dense integer frequency spectra, requiring O(log_3(omega_max)) encoding gates, ensuring target frequencies lie within locally reachable range for gradient optimization.

Result: On synthetic targets with three shifted high frequencies, ternary grid initialization achieves median R^2 of 0.9969 vs 0.1841 for trainable-frequency baseline. For Flight Passengers dataset, ternary grid achieves median R^2 of 0.9671 (22.8% improvement over 0.7876).

Conclusion: Ternary grid initialization overcomes frequency reachability limitations in quantum machine learning, providing practical effectiveness where theoretical trainable-frequency approaches fail due to gradient optimization constraints.

Abstract: Quantum machine learning models using angle encoding naturally represent truncated Fourier series, providing universal function approximation capabilities with sufficient circuit depth. For unary fixed-frequency encodings, circuit depth scales as O(omega_max * (omega_max + epsilon^{-2})) with target frequency magnitude omega_max and precision epsilon. Trainable-frequency approaches theoretically reduce this to match the target spectrum size, requiring only as many encoding gates as frequencies in the target spectrum. Despite this compelling efficiency, their practical effectiveness hinges on a key assumption: that gradient-based optimization can drive prefactors to arbitrary target values. We demonstrate through systematic experiments that frequency prefactors exhibit limited trainability: movement is constrained to approximately +/-1 units with typical learning rates. When target frequencies lie outside this reachable range, optimization frequently fails. To overcome this frequency reachability limitation, we propose grid-based initialization using ternary encodings, which generate dense integer frequency spectra. While this approach requires O(log_3(omega_max)) encoding gates – more than the theoretical optimum but exponentially fewer than fixed-frequency methods – it ensures target frequencies lie within the locally reachable range. On synthetic targets with three shifted high frequencies, ternary grid initialization achieves a median R^2 score of 0.9969, compared to 0.1841 for the trainable-frequency baseline. For the real-world Flight Passengers dataset, ternary grid initialization achieves a median R^2 score of 0.9671, representing a 22.8% improvement over trainable-frequency initialization (median R^2 = 0.7876).

[376] Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Hanning Guo, Farah Abdellatif, Hanwen Bi, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

Main category: cs.LG

TL;DR: Brain-OF is the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG, and MEG data, enabling unified multimodal brain signal analysis through innovative resolution handling and dual-domain pretraining.

DetailsMotivation: Existing brain foundation models are limited to single functional modalities, missing opportunities to leverage complementary spatiotemporal dynamics and collective data scale across different brain imaging techniques.

Method: Proposes Brain-OF with: 1) Any-Resolution Neural Signal Sampler to project diverse brain signals into shared semantic space, 2) DINT attention with Sparse Mixture of Experts for modality-invariant and modality-specific representations, and 3) Masked Temporal-Frequency Modeling for dual-domain pretraining in time and frequency domains.

Result: Pretrained on ~40 datasets, Brain-OF demonstrates superior performance across diverse downstream tasks, highlighting benefits of joint multimodal integration and dual-domain pretraining.

Conclusion: Brain-OF successfully addresses multimodal brain signal integration limitations and shows the value of unified multimodal foundation models for neuroscience applications.

Abstract: Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across imaging techniques. To address this limitation, we propose Brain-OF, the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space.To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

[377] Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents

Aishwarya Sarkar, Sayan Ghosh, Nathan Tallent, Aman Chadha, Tanya Roosta, Ali Jannesari

Main category: cs.LG

TL;DR: Rudder uses LLMs for adaptive prefetching in distributed GNN training, improving performance by 91% over baseline and 82% over static prefetching

DetailsMotivation: Distributed GNN training suffers from frequent irregular communication stalls due to neighbor sampling. Static prefetching methods fail to adapt to dynamic conditions like graph structure, distribution, sampling parameters, and caching policies.

Method: Rudder embeds LLMs with emergent In-Context Learning capabilities to autonomously prefetch remote nodes in AWS DistDGL framework. Uses generative AI’s logical multi-step reasoning for adaptive control even with undertraining.

Result: 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), 82% improvement over static prefetching, reducing communication by over 50% on NERSC Perlmutter supercomputer with standard datasets.

Conclusion: LLMs’ emergent reasoning capabilities are well-suited for adaptive control in distributed systems, enabling significant performance improvements in GNN training through intelligent prefetching.

Abstract: Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex’s neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder’s adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.

[378] EvoX: Meta-Evolution for Automated Discovery

Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, Ion Stoica

Main category: cs.LG

TL;DR: EvoX introduces an adaptive evolutionary optimization method that jointly evolves candidate solutions and the search strategies themselves, enabling dynamic adaptation of exploration-exploitation trade-offs during optimization.

DetailsMotivation: Existing LLM-driven evolutionary methods use fixed search strategies with static parameters that don't adapt to changing search spaces or task requirements, limiting their effectiveness across diverse optimization problems.

Method: EvoX jointly evolves both candidate solutions and the search strategies used to generate them, continuously updating how prior solutions are selected and varied based on optimization progress, enabling dynamic adaptation of search strategies.

Result: EvoX outperforms existing AI-driven evolutionary methods (AlphaEvolve, OpenEvolve, GEPA, ShinkaEvolve) on the majority of nearly 200 real-world optimization tasks.

Conclusion: Jointly evolving solutions and search strategies enables more effective adaptive optimization that can outperform fixed-strategy evolutionary methods across diverse real-world tasks.

Abstract: Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide the model toward new candidate solutions. Crucially, the effectiveness of this evolution process depends on the search strategy: how prior solutions are selected and varied to generate new candidates. However, most existing methods rely on fixed search strategies with predefined knobs (e.g., explore-exploit ratios) that remain static throughout execution. While effective in some settings, these approaches often fail to adapt across tasks, or even within the same task as the search space changes over time. We introduce EvoX, an adaptive evolution method that optimizes its own evolution process. EvoX jointly evolves candidate solutions and the search strategies used to generate them, continuously updating how prior solutions are selected and varied based on progress. This enables the system to dynamically shift between different search strategies during the optimization process. Across nearly 200 real-world optimization tasks, EvoX outperforms existing AI-driven evolutionary methods including AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on the majority of tasks.

[379] Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing

Adhyyan Narang, Sarah Dean, Lillian J Ratliff, Maryam Fazel

Main category: cs.LG

TL;DR: Paper proposes knowledge distillation-based algorithm to prevent overspecialization trap in multi-platform ML where learners only see data from users who prefer them, enabling convergence to models with good global performance.

DetailsMotivation: In multi-platform ML deployments, each platform only sees data from users who prefer it, causing overspecialization trap where platforms optimize for their existing users and become less attractive to others, leading to poor global performance despite good local optimization.

Method: Proposes algorithm where learners “probe” predictions of peer models using knowledge distillation techniques, allowing them to learn about users who don’t select them. Requires probing sources to be sufficiently informative (e.g., market leader or majority of peers with good global performance).

Result: The algorithm converges almost surely to stationary points with bounded full-population risk when probing sources are sufficiently informative. Verified with semi-synthetic experiments on MovieLens, Census, and Amazon Sentiment datasets.

Conclusion: Knowledge distillation-based probing enables learners to escape overspecialization trap and achieve good global performance in multi-platform ML settings by learning from peer models about users outside their immediate data distribution.

Abstract: In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting focuses exclusively on the “local” losses of learners on the distribution of data that they observe. We find that there exist instances where learners who use existing algorithms almost surely converge to models with arbitrarily poor global performance, even when models with low full-population loss exist. This happens through a feedback-induced mechanism, which we call the overspecialization trap: as learners optimize for users who already prefer them, they become less attractive to users outside this base, which further restricts the data they observe. Inspired by the recent use of knowledge distillation in modern ML, we propose an algorithm that allows learners to “probe” the predictions of peer models, enabling them to learn about users who do not select them. Our analysis characterizes when probing succeeds: this procedure converges almost surely to a stationary point with bounded full-population risk when probing sources are sufficiently informative, e.g., a known market leader or a majority of peers with good global performance. We verify our findings with semi-synthetic experiments on the MovieLens, Census, and Amazon Sentiment datasets.

[380] Human Supervision as an Information Bottleneck: A Unified Theory of Error Floors in Human-Guided Learning

Alejandro Rodriguez Dominguez

Main category: cs.LG

TL;DR: Theoretical paper showing human supervision creates inherent error floors in AI systems due to annotation noise, preference distortion, and semantic compression, which scaling alone cannot overcome.

DetailsMotivation: LLMs trained on human data exhibit persistent errors despite scaling, suggesting limitations in the human supervision channel itself rather than model capacity or optimization.

Method: Develops unified theory across six frameworks (operator theory, PAC-Bayes, information theory, causal inference, category theory, game theory) showing human supervision acts as information-reducing channel, creating strictly positive excess-risk floors.

Result: Theoretical predictions confirmed: human-only supervision shows persistent error floors, while auxiliary non-human signals (retrieval, program execution, tools) can collapse these floors by restoring information about latent targets.

Conclusion: Scaling alone cannot eliminate human-aligned errors; auxiliary non-human supervision channels are needed to overcome inherent limitations of human supervision.

Abstract: Large language models are trained primarily on human-generated data and feedback, yet they exhibit persistent errors arising from annotation noise, subjective preferences, and the limited expressive bandwidth of natural language. We argue that these limitations reflect structural properties of the supervision channel rather than model scale or optimization. We develop a unified theory showing that whenever the human supervision channel is not sufficient for a latent evaluation target, it acts as an information-reducing channel that induces a strictly positive excess-risk floor for any learner dominated by it. We formalize this Human-Bounded Intelligence limit and show that across six complementary frameworks (operator theory, PAC-Bayes, information theory, causal inference, category theory, and game-theoretic analyses of reinforcement learning from human feedback), non-sufficiency yields strictly positive lower bounds arising from the same structural decomposition into annotation noise, preference distortion, and semantic compression. The theory explains why scaling alone cannot eliminate persistent human-aligned errors and characterizes conditions under which auxiliary non-human signals (e.g., retrieval, program execution, tools) increase effective supervision capacity and collapse the floor by restoring information about the latent target. Experiments on real preference data, synthetic known-target tasks, and externally verifiable benchmarks confirm the predicted structural signatures: human-only supervision exhibits a persistent floor, while sufficiently informative auxiliary channels strictly reduce or eliminate excess error.

[381] Global Interpretability via Automated Preprocessing: A Framework Inspired by Psychiatric Questionnaires

Eric V. Strobl

Main category: cs.LG

TL;DR: REFINE is a two-stage method that decouples preprocessing from prediction: uses nonlinear preprocessing to estimate stable item values from psychiatric questionnaires, then learns linear mapping to future symptom severity for interpretability.

DetailsMotivation: Psychiatric questionnaires are context-sensitive and weakly predictive of future symptoms. While nonlinear models can improve accuracy, they lack interpretability which erodes clinical trust. Need a method that balances predictive power with transparency for clinical adoption.

Method: Two-stage approach: 1) Nonlinear preprocessing module extracts stable signal from questionnaire data by estimating stable item values, 2) Linear mapping from stabilized baseline items to future severity. Concentrates nonlinearity in preprocessing while keeping prognostic relationship linear and interpretable through coefficient matrix.

Result: REFINE outperforms other interpretable approaches while preserving clear global attribution of prognostic factors across psychiatric and non-psychiatric longitudinal prediction tasks.

Conclusion: REFINE successfully balances predictive accuracy with interpretability by decoupling nonlinear preprocessing from linear prediction, providing transparent global attribution of prognostic factors for clinical trust.

Abstract: Psychiatric questionnaires are highly context sensitive and often only weakly predict subsequent symptom severity, which makes the prognostic relationship difficult to learn. Although flexible nonlinear models can improve predictive accuracy, their limited interpretability can erode clinical trust. In fields such as imaging and omics, investigators commonly address visit- and instrument-specific artifacts by extracting stable signal through preprocessing and then fitting an interpretable linear model. We adopt the same strategy for questionnaire data by decoupling preprocessing from prediction: we restrict nonlinear capacity to a baseline preprocessing module that estimates stable item values, and then learn a linear mapping from these stabilized baseline items to future severity. We refer to this two-stage method as REFINE (Redundancy-Exploiting Follow-up-Informed Nonlinear Enhancement), which concentrates nonlinearity in preprocessing while keeping the prognostic relationship transparently linear and therefore globally interpretable through a coefficient matrix, rather than through post hoc local attributions. In experiments, REFINE outperforms other interpretable approaches while preserving clear global attribution of prognostic factors across psychiatric and non-psychiatric longitudinal prediction tasks.

[382] Uncertainty-aware Language Guidance for Concept Bottleneck Models

Yangyi Li, Mengdi Huai

Main category: cs.LG

TL;DR: Uncertainty-aware Concept Bottleneck Models that leverage LLMs for concept annotation while quantifying and incorporating uncertainty to address LLM hallucination risks.

DetailsMotivation: Traditional CBMs require extensive expert annotation, while existing LLM-based CBMs overlook uncertainty in LLM-generated concepts, increasing error risks from hallucinations.

Method: Proposes uncertainty-aware CBM that quantifies uncertainty of LLM-annotated concepts with distribution-free guarantees and incorporates this uncertainty into CBM training to account for varying reliability across concepts.

Result: Extensive experiments on real-world datasets validate the desired properties of the proposed method, with theoretical analysis provided.

Conclusion: The method addresses limitations of existing LLM-based CBMs by properly handling uncertainty in LLM annotations, making CBMs more practical without extensive expert annotation.

Abstract: Concept Bottleneck Models (CBMs) provide inherent interpretability by first mapping input samples to high-level semantic concepts, followed by a combination of these concepts for the final classification. However, the annotation of human-understandable concepts requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. On the other hand, there are a few works that leverage the knowledge of large language models (LLMs) to construct concept bottlenecks. Nevertheless, they face two essential limitations: First, they overlook the uncertainty associated with the concepts annotated by LLMs and lack a valid mechanism to quantify uncertainty about the annotated concepts, increasing the risk of errors due to hallucinations from LLMs. Additionally, they fail to incorporate the uncertainty associated with these annotations into the learning process for concept bottleneck models. To address these limitations, we propose a novel uncertainty-aware CBM method, which not only rigorously quantifies the uncertainty of LLM-annotated concept labels with valid and distribution-free guarantees, but also incorporates quantified concept uncertainty into the CBM training procedure to account for varying levels of reliability across LLM-annotated concepts. We also provide the theoretical analysis for our proposed method. Extensive experiments on the real-world datasets validate the desired properties of our proposed methods.

[383] CSyMR: Benchmarking Compositional Music Information Retrieval in Symbolic Music Reasoning

Boyang Wang, Yash Vishe, Xin Xu, Zachary Novack, Xunyi Jiang, Julian McAuley, Junda Wu

Main category: cs.LG

TL;DR: CSyMR-Bench: A benchmark for compositional music information retrieval requiring multi-step reasoning over symbolic music scores, with tool-augmented approaches outperforming LLM-only methods.

DetailsMotivation: Natural language queries about symbolic music scores often require complex, multi-step reasoning that combines multiple pieces of evidence from structured notation. Current LLMs struggle with this due to mismatches between natural language and symbolic representations, and existing benchmarks don't adequately capture these compositional retrieval demands.

Method: Introduces CSyMR-Bench with 126 multiple choice questions from real user scenarios, categorized by query intent and analytical dimensions. Proposes a tool-augmented retrieval framework combining ReAct-style controller with deterministic symbolic analysis operators built with music21.

Result: Tool-grounded compositional retrieval consistently outperforms LLM-only approaches by 5-7% absolute accuracy gains, with largest improvements on analysis-heavy categories.

Conclusion: Compositional MIR requires specialized tool integration beyond pure LLM capabilities, and the benchmark provides a foundation for evaluating multi-step reasoning over symbolic music representations.

Abstract: Natural language information needs over symbolic music scores rarely reduce to a single step lookup. Many queries require compositional Music Information Retrieval (MIR) that extracts multiple pieces of evidence from structured notation and aggregates them to answer the question. This setting remains challenging for Large Language Models due to the mismatch between natural language intents and symbolic representations, as well as the difficulty of reliably handling long structured contexts. Existing benchmarks only partially capture these retrieval demands, often emphasizing isolated theoretical knowledge or simplified settings. We introduce CSyMR-Bench, a benchmark for compositional MIR in symbolic music reasoning grounded in authentic user scenarios. It contains 126 multiple choice questions curated from community discussions and professional examinations, where each item requires chaining multiple atomic analyses over a score to derive implicit musical evidence. To support diagnosis, we provide a taxonomy with six query intent categories and six analytical dimension tags. We further propose a tool-augmented retrieval and reasoning framework that integrates a ReAct-style controller with deterministic symbolic analysis operators built with music21. Experiments across prompting baselines and agent variants show that tool-grounded compositional retrieval consistently outperforms Large Language Model-only approaches, yielding 5-7% absolute accuracy gains, with the largest improvements on analysis-heavy categories.

[384] FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments

Anik Pramanik, Murat Kantarcioglu, Vincent Oria, Shantanu Sharma

Main category: cs.LG

TL;DR: FedDAG introduces a clustered federated learning framework that uses weighted class-wise similarity combining data and gradient information for better client clustering, and employs dual-encoder architecture for cross-cluster knowledge transfer while maintaining cluster specialization.

DetailsMotivation: Current clustered FL approaches have two main limitations: 1) they rely on either data similarity or gradient similarity alone, providing incomplete client similarity assessment, and 2) they restrict knowledge sharing to within clusters only, preventing models from benefiting from diverse client populations across clusters.

Method: FedDAG uses a weighted, class-wise similarity metric that integrates both data distribution and gradient information for more holistic client clustering. It employs a dual-encoder architecture where each cluster model has a primary encoder trained on its own clients’ data and a secondary encoder refined using gradients from complementary clusters, enabling cross-cluster feature transfer.

Result: Experiments on diverse benchmarks and data heterogeneity settings show that FedDAG consistently outperforms state-of-the-art clustered FL baselines in accuracy.

Conclusion: FedDAG addresses limitations of existing clustered FL approaches by providing more comprehensive client similarity assessment and enabling beneficial cross-cluster knowledge transfer while maintaining cluster specialization, leading to improved performance in heterogeneous federated learning settings.

Abstract: Federated Learning (FL) enables a group of clients to collaboratively train a model without sharing individual data, but its performance drops when client data are heterogeneous. Clustered FL tackles this by grouping similar clients. However, existing clustered FL approaches rely solely on either data similarity or gradient similarity; however, this results in an incomplete assessment of client similarities. Prior clustered FL approaches also restrict knowledge and representation sharing to clients within the same cluster. This prevents cluster models from benefiting from the diverse client population across clusters. To address these limitations, FedDAG introduces a clustered FL framework, FedDAG, that employs a weighted, class-wise similarity metric that integrates both data and gradient information, providing a more holistic measure of similarity during clustering. In addition, FedDAG adopts a dual-encoder architecture for cluster models, comprising a primary encoder trained on its own clients’ data and a secondary encoder refined using gradients from complementary clusters. This enables cross-cluster feature transfer while preserving cluster-specific specialization. Experiments on diverse benchmarks and data heterogeneity settings show that FedDAG consistently outperforms state-of-the-art clustered FL baselines in accuracy.

[385] Sample Size Calculations for Developing Clinical Prediction Models: Overview and pmsims R package

Diana Shamsutdinova, Felix Zimmer, Oyebayo Ridwan Olaniran, Sarah Markham, Daniel Stahl, Gordon Forbes, Ewan Carr

Main category: cs.LG

TL;DR: A framework and R package (pmsims) for sample size estimation in clinical prediction models using simulation-based methods with learning curves and Gaussian Process optimization.

DetailsMotivation: Current sample size estimation methods for clinical prediction models are inadequate - heuristic rules, closed-form formulas, and simulation methods vary in flexibility and accuracy, especially for complex data structures and machine learning models. Inadequate sample sizes lead to overfitting, poor generalizability, and biased predictions.

Method: Proposes a simulation-based approach integrating learning curves, Gaussian Process optimization, and assurance principles to identify sample sizes achieving target performance with high probability. Implemented in pmsims, an open-source, model-agnostic R package.

Result: Sample size estimates vary substantially across methods, performance metrics, and modeling strategies. pmsims provides flexible, efficient, and interpretable solutions accommodating diverse models and user-defined metrics while accounting for performance variability.

Conclusion: The framework and software advance sample size methodology for clinical prediction modeling by combining flexibility with computational efficiency. Future work should extend to hierarchical/multimodal data, incorporate fairness/stability metrics, and address missing data/complex dependencies.

Abstract: Background: Clinical prediction models are increasingly used to inform healthcare decisions, but determining the minimum sample size for their development remains a critical and unresolved challenge. Inadequate sample sizes can lead to overfitting, poor generalisability, and biased predictions. Existing approaches, such as heuristic rules, closed-form formulas, and simulation-based methods, vary in flexibility and accuracy, particularly for complex data structures and machine learning models. Methods: We review current methodologies for sample size estimation in prediction modelling and introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria. Building on this, we propose a novel simulation-based approach that integrates learning curves, Gaussian Process optimisation, and assurance principles to identify sample sizes that achieve target performance with high probability. This approach is implemented in pmsims, an open-source, model-agnostic R package. Results: Through case studies, we demonstrate that sample size estimates vary substantially across methods, performance metrics, and modelling strategies. Compared to existing tools, pmsims provides flexible, efficient, and interpretable solutions that accommodate diverse models and user-defined metrics while explicitly accounting for variability in model performance. Conclusions: Our framework and software advance sample size methodology for clinical prediction modelling by combining flexibility with computational efficiency. Future work should extend these methods to hierarchical and multimodal data, incorporate fairness and stability metrics, and address challenges such as missing data and complex dependency structures.

[386] Training Generalizable Collaborative Agents via Strategic Risk Aversion

Chengrui Qu, Yizhou Zhang, Nicolas Lanzetti, Eric Mazumdar

Main category: cs.LG

TL;DR: The paper proposes strategic risk aversion as an inductive bias for learning robust collaborative policies in multi-agent systems, developing a MARL algorithm that achieves reliable cooperation with unseen partners.

DetailsMotivation: Existing approaches for learning collaborative policies produce brittle solutions that fail when paired with new partners, due to free-riding during training and lack of strategic robustness.

Method: The paper studies strategic risk aversion as a principled inductive bias, develops a multi-agent reinforcement learning algorithm that integrates strategic risk aversion into standard policy optimization methods, and validates it across collaborative benchmarks including an LLM collaboration task.

Result: Empirical results validate the theory and demonstrate that the approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

Conclusion: Strategic risk aversion provides an effective framework for learning generalizable collaborative policies that are robust to partner variations, addressing key limitations of existing multi-agent learning approaches.

Abstract: Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner’s behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.

[387] Neural Operators Can Discover Functional Clusters

Yicen Li, Jose Antonio Lara Benitez, Ruiyang Hong, Anastasis Kratsios, Paul David McNicholas, Maarten Valentijn de Hoop

Main category: cs.LG

TL;DR: Neural operators can learn to cluster infinite-dimensional functional data, enabling classification of complex ODE trajectories where traditional methods fail.

DetailsMotivation: While neural operators are well-understood for regression tasks, their capabilities for classification and clustering of functional data remain largely unexplored. The paper aims to establish theoretical foundations for neural operator-based clustering and develop practical applications for analyzing unlabeled families of ODE trajectories.

Method: Proves universal clustering theorem showing neural operators can approximate any finite collection of classes in infinite-dimensional RKHS. Develops practical pipeline: discretized ODE trajectories are lifted by pre-trained encoder into continuous feature maps, then mapped to soft assignments by lightweight trainable head (SNO - Sample-based Neural Operator).

Result: Theoretical proof that neural operators can cluster any finite collection of classes in infinite-dimensional spaces, even when classes are non-convex/disconnected. Experimental results on synthetic ODE benchmarks show SNO successfully recovers latent dynamical structure where classical clustering methods fail.

Conclusion: Neural operators provide powerful framework for functional data clustering with theoretical guarantees. The SNO pipeline enables practical application to dynamical systems analysis, bridging theory and practice for operator-based classification tasks.

Abstract: Operator learning is reshaping scientific computing by amortizing inference across infinite families of problems. While neural operators (NOs) are increasingly well understood for regression, far less is known for classification and its unsupervised analogue: clustering. We prove that sample-based neural operators can learn any finite collection of classes in an infinite-dimensional reproducing kernel Hilbert space, even when the classes are neither convex nor connected, under mild kernel sampling assumptions. Our universal clustering theorem shows that any $K$ closed classes can be approximated to arbitrary precision by NO-parameterized classes in the upper Kuratowski topology on closed sets, a notion that can be interpreted as disallowing false-positive misclassifications. Building on this, we develop an NO-powered clustering pipeline for functional data and apply it to unlabeled families of ordinary differential equation (ODE) trajectories. Discretized trajectories are lifted by a fixed pre-trained encoder into a continuous feature map and mapped to soft assignments by a lightweight trainable head. Experiments on diverse synthetic ODE benchmarks show that the resulting practical SNO recovers latent dynamical structure in regimes where classical methods fail, providing evidence consistent with our universal clustering theory.

[388] ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Tianjun Yao, Yongqiang Chen, Yujia Zheng, Pan Li, Zhiqiang Shen, Kun Zhang

Main category: cs.LG

TL;DR: ParamAgent framework with ParamMem module enhances language agents by encoding cross-sample reflection patterns into model parameters for diverse reflection generation, improving performance on reasoning tasks.

DetailsMotivation: Current self-reflection in language agents often produces repetitive outputs that limit reasoning performance. There's a strong correlation between reflective diversity and task success, motivating the need for diverse reflection signals to improve agent capabilities.

Method: Introduces ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Builds ParamAgent framework integrating parametric memory with episodic and cross-sample memory.

Result: Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without stronger external models.

Conclusion: ParamMem shows potential as an effective component for enhancing language agents through diverse reflection generation, enabling better reasoning performance across various tasks without reliance on external stronger models.

Abstract: Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.

[389] Active Value Querying to Minimize Additive Error in Subadditive Set Function Learning

Martin Černý, David Sychrovský, Filip Úradník, Jakub Černý

Main category: cs.LG

TL;DR: Paper studies approximation of subadditive set functions with missing values using additive error bounds, focusing on minimal/maximal completions and algorithms to minimize distance between them.

DetailsMotivation: Subadditive set functions are important in computational economics, combinatorial optimization, and interpretable ML, but specifying them requires exponentially many values. Missing values create ambiguity, especially when optimizing incomplete functions. Prior work shows inapproximability with multiplicative error, so this work focuses on additive error approximation.

Method: Threefold approach: (1) Analyze minimal and maximal completions of set functions with missing values across different function classes; (2) Develop methods to minimize distance between completions by disclosing additional subset values in offline and online settings; (3) Empirical evaluation of algorithms in practical scenarios.

Result: The paper provides theoretical analysis of completion distances for different set function classes and develops algorithms to efficiently reduce these distances by strategically querying additional values.

Conclusion: The work addresses the practical challenge of working with incomplete subadditive set functions by providing approximation methods with additive error bounds and efficient query strategies for completion.

Abstract: Subadditive set functions play a pivotal role in computational economics (especially in combinatorial auctions), combinatorial optimization or artificial intelligence applications such as interpretable machine learning. However, specifying a set function requires assigning values to an exponentially large number of subsets in general, a task that is often resource-intensive in practice, particularly when the values derive from external sources such as retraining of machine learning models. A~simple omission of certain values introduces ambiguity that becomes even more significant when the incomplete set function has to be further optimized over. Motivated by the well-known result about inapproximability of subadditive functions using deterministic value queries with respect to a multiplicative error, we study a problem of approximating an unknown subadditive (or a subclass of thereof) set function with respect to an additive error – i. e., we aim to efficiently close the distance between minimal and maximal completions. Our contributions are threefold: (i) a thorough exploration of minimal and maximal completions of different classes of set functions with missing values and an analysis of their resulting distance; (ii) the development of methods to minimize this distance over classes of set functions with a known prior, achieved by disclosing values of additional subsets in both offline and online manner; and (iii) empirical demonstrations of the algorithms’ performance in practical scenarios.

[390] Flowette: Flow Matching with Graphette Priors for Graph Generation

Asiri Wijesinghe, Sevvandi Kandanaarachchi, Daniel M. Steinberg, Cheng Soon Ong

Main category: cs.LG

TL;DR: Flowette: A flow matching framework for graph generation that incorporates structural priors via “graphettes” to model recurring subgraph motifs like rings, stars, and trees.

DetailsMotivation: The paper addresses the challenge of generative modeling of graphs with recurring subgraph motifs, aiming to capture complex structural patterns that are common in real-world graphs like molecular structures.

Method: Proposes Flowette, a continuous flow matching framework using graph neural network transformers to learn velocity fields over graph representations. Introduces “graphettes” - a probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs. Uses optimal transport for topology preservation and regularization for long-range dependencies.

Result: Flowette demonstrates consistent improvements on synthetic and small-molecule graph generation tasks, showing effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

Conclusion: The combination of structural priors (graphettes) with flow-based training is effective for modeling complex graph distributions with recurring motifs, providing theoretical guarantees and empirical improvements.

Abstract: We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework, that employs a graph neural network based transformer to learn a velocity field defined over graph representations with node and edge attributes. Our model preserves topology through optimal transport based coupling, and long-range structural dependencies through regularisation. To incorporate domain driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs like rings, stars and trees. We theoretically analyze the coupling, invariance, and structural properties of the proposed framework, and empirically evaluate it on synthetic and small-molecule graph generation tasks. Flowette demonstrates consistent improvements, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

[391] Hybrid Quantum Temporal Convolutional Networks

Junghoon Justin Park, Maria Pak, Sebin Lee, Samuel Yen-Chi Chen, Shinjae Yoo, Huan-Hsin Tseng, Jiook Cha

Main category: cs.LG

TL;DR: Hybrid Quantum Temporal Convolutional Network (HQTCN) combines classical temporal windowing with quantum CNN for efficient multivariate time-series analysis with parameter reduction.

DetailsMotivation: Quantum machine learning models for sequential data face scalability challenges with complex multivariate signals, needing parameter-efficient approaches for time-series analysis.

Method: HQTCN combines classical temporal windowing with a quantum convolutional neural network core, applying shared quantum circuits across temporal windows to capture long-range dependencies.

Result: HQTCN performs competitively with classical baselines on univariate data and outperforms all baselines on multivariate tasks, especially under data-limited conditions with fewer parameters.

Conclusion: HQTCN establishes a parameter-efficient approach for multivariate time-series analysis, demonstrating quantum advantages for sequential data processing.

Abstract: Quantum machine learning models for sequential data face scalability challenges with complex multivariate signals. We introduce the Hybrid Quantum Temporal Convolutional Network (HQTCN), which combines classical temporal windowing with a quantum convolutional neural network core. By applying a shared quantum circuit across temporal windows, HQTCN captures long-range dependencies while achieving significant parameter reduction. Evaluated on synthetic NARMA sequences and high-dimensional EEG time-series, HQTCN performs competitively with classical baselines on univariate data and outperforms all baselines on multivariate tasks. The model demonstrates particular strength under data-limited conditions, maintaining high performance with substantially fewer parameters than conventional approaches. These results establish HQTCN as a parameter-efficient approach for multivariate time-series analysis.

[392] SDMixer: Sparse Dual-Mixer for Time Series Forecasting

Xiang Ao

Main category: cs.LG

TL;DR: Dual-stream sparse Mixer framework for multivariate time series forecasting that extracts global trends and local dynamics in frequency/time domains using sparsity to filter noise.

DetailsMotivation: Multivariate time series forecasting faces challenges with multi-scale characteristics, weak correlations, and noise interference that limit existing models' predictive performance.

Method: Proposes a dual-stream sparse Mixer framework that extracts global trends and local dynamic features from sequences in both frequency and time domains, using sparsity mechanisms to filter invalid information for better cross-variable dependency modeling.

Result: Achieves leading performance on multiple real-world scenario datasets, validating effectiveness and generality.

Conclusion: The proposed framework effectively addresses multi-scale, weak correlation, and noise issues in multivariate time series forecasting through dual-stream frequency/time domain analysis with sparsity filtering.

Abstract: Multivariate time series forecasting is widely applied in fields such as transportation, energy, and finance. However, the data commonly suffers from issues of multi-scale characteristics, weak correlations, and noise interference, which limit the predictive performance of existing models. This paper proposes a dual-stream sparse Mixer prediction framework that extracts global trends and local dynamic features from sequences in both the frequency and time domains, respectively. It employs a sparsity mechanism to filter out invalid information, thereby enhancing the accuracy of cross-variable dependency modeling. Experimental results demonstrate that this method achieves leading performance on multiple real-world scenario datasets, validating its effectiveness and generality. The code is available at https://github.com/SDMixer/SDMixer

[393] Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection

Dang Sy Duy, Nguyen Duy Chien, Kapil Dev, Jeff Nijsse

Main category: cs.LG

TL;DR: Systematic study of initialization and normalization strategies for GNNs in financial fraud detection, showing architecture-dependent effects on the Elliptic Bitcoin dataset.

DetailsMotivation: GNNs show promise for financial fraud detection but their effectiveness depends on training practices like weight initialization and normalization that are underexplored, especially for real-world anti-money laundering applications with severe class imbalance.

Method: Conducted systematic ablation of initialization (Xavier, Kaiming, etc.) and normalization (GraphNorm, BatchNorm, LayerNorm) strategies across three GNN architectures (GCN, GAT, GraphSAGE) on the Elliptic Bitcoin dataset with temporal splits and seeded runs.

Result: Found architecture-dependent effects: GraphSAGE performs best with Xavier initialization alone, GAT benefits most from GraphNorm + Xavier initialization, while GCN shows limited sensitivity to these modifications.

Conclusion: Provides practical, architecture-specific guidance for deploying GNNs in AML pipelines, especially for imbalanced datasets, and releases a reproducible experimental framework.

Abstract: Graph neural networks (GNNs) offer a principled approach to financial fraud detection by jointly learning from node features and transaction graph topology. However, their effectiveness on real-world anti-money laundering (AML) benchmarks depends critically on training practices such as specifically weight initialisation and normalisation that remain underexplored. We present a systematic ablation of initialisation and normalisation strategies across three GNN architectures (GCN, GAT, and GraphSAGE) on the Elliptic Bitcoin dataset. Our experiments reveal that initialisation and normalisation are architecture-dependent: GraphSAGE achieves the strongest performance with Xavier initialisation alone, GAT benefits most from combining GraphNorm with Xavier initialisation, while GCN shows limited sensitivity to these modifications. These findings offer practical, architecture-specific guidance for deploying GNNs in AML pipelines for datasets with severe class imbalance. We release a reproducible experimental framework with temporal data splits, seeded runs, and full ablation results.

[394] When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion

Kejing Yin, Haizhou Xu, Wenfang Yao, Chen Liu, Zijie Chen, Yui Haang Cheung, William K. Cheung, Jing Qin

Main category: cs.LG

TL;DR: Systematic benchmark of multimodal fusion between EHR and chest X-rays for clinical prediction, evaluating performance, robustness to missing modalities, and fairness.

DetailsMotivation: To understand when multimodal learning truly helps in clinical practice, particularly under modality missingness and fairness constraints, by systematically benchmarking EHR and CXR fusion.

Method: Conducted systematic benchmark on standardized cohorts from MIMIC-IV and MIMIC-CXR, evaluating multimodal fusion strategies, robustness to missing modalities, and algorithmic fairness across demographic groups.

Result: Multimodal fusion improves performance with complete modalities, especially for diseases needing complementary EHR/CXR info. Benefits degrade with missing modalities unless explicitly designed. Cross-modal learning captures meaningful dependencies but EHR temporal structure creates modality imbalance. Multimodal fusion doesn’t inherently improve fairness, with disparities from unequal sensitivity across demographics.

Conclusion: Provides actionable guidance on when multimodal learning helps/fails in clinical settings, with open-source benchmarking toolkit for reproducible evaluation of clinically deployable multimodal systems.

Abstract: Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X-rays (CXR) on standardized cohorts from MIMIC-IV and MIMIC-CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross-modal learning mechanisms capture clinically meaningful dependencies beyond simple concatenation, the rich temporal structure of EHR introduces strong modality imbalance that architectural complexity alone cannot overcome. Under realistic missingness, multimodal benefits rapidly degrade unless models are explicitly designed to handle incomplete inputs. Moreover, multimodal fusion does not inherently improve fairness, with subgroup disparities mainly arising from unequal sensitivity across demographic groups. To support reproducible and extensible evaluation, we further release a flexible benchmarking toolkit that enables plug-and-play integration of new models and datasets. Together, this work provides actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems that are both effective and reliable. The open-source toolkit can be found at https://github.com/jakeykj/CareBench.

[395] BTTackler: A Diagnosis-based Framework for Efficient Deep Learning Hyperparameter Optimization

Zhongyi Pei, Zhiyao Cen, Yipeng Huang, Chen Wang, Lin Liu, Philip Yu, Mingsheng Long

Main category: cs.LG

TL;DR: BTTackler is a hyperparameter optimization framework that uses training diagnosis to identify and terminate bad trials early, improving efficiency by reducing wasted computation on problematic configurations.

DetailsMotivation: Current automated HPO methods waste computation on trials with training problems (vanishing gradients, insufficient convergence) that aren't reflected in early accuracy metrics, leading to inefficient optimization.

Method: Proposes BTTackler framework that diagnoses trials using quantified indicators to detect training problems, triggering early termination of bad trials to free up resources for better configurations.

Result: Reduces time consumption by 40.33% to achieve comparable accuracy, conducts 44.5% more top-10 trials within given time budget, and outperforms baseline HPO methods on various tasks.

Conclusion: BTTackler effectively tackles bad trials in HPO through training diagnosis, significantly improving optimization efficiency with minimal code changes required.

Abstract: Hyperparameter optimization (HPO) is known to be costly in deep learning, especially when leveraging automated approaches. Most of the existing automated HPO methods are accuracy-based, i.e., accuracy metrics are used to guide the trials of different hyperparameter configurations amongst a specific search space. However, many trials may encounter severe training problems, such as vanishing gradients and insufficient convergence, which can hardly be reflected by accuracy metrics in the early stages of the training and often result in poor performance. This leads to an inefficient optimization trajectory because the bad trials occupy considerable computation resources and reduce the probability of finding excellent hyperparameter configurations within a time limitation. In this paper, we propose \textbf{Bad Trial Tackler (BTTackler)}, a novel HPO framework that introduces training diagnosis to identify training problems automatically and hence tackles bad trials. BTTackler diagnoses each trial by calculating a set of carefully designed quantified indicators and triggers early termination if any training problems are detected. Evaluations are performed on representative HPO tasks consisting of three classical deep neural networks (DNN) and four widely used HPO methods. To better quantify the effectiveness of an automated HPO method, we propose two new measurements based on accuracy and time consumption. Results show the advantage of BTTackler on two-fold: (1) it reduces 40.33% of time consumption to achieve the same accuracy comparable to baseline methods on average and (2) it conducts 44.5% more top-10 trials than baseline methods on average within a given time budget. We also released an open-source Python library that allows users to easily apply BTTackler to automated HPO processes with minimal code changes.

[396] On the Convergence of Single-Loop Stochastic Bilevel Optimization with Approximate Implicit Differentiation

Yubo Zhou, Luo Luo, Guang Dai, Haishan Ye

Main category: cs.LG

TL;DR: Single-loop stochastic bilevel optimization algorithm (SSAID) achieves optimal convergence rate matching multi-loop methods while maintaining computational efficiency, with explicit characterization of condition number dependence.

DetailsMotivation: Single-loop algorithms for stochastic bilevel optimization are widely used in practice (meta-learning, hyperparameter optimization) but lack rigorous theoretical understanding compared to multi-loop methods, with existing analyses showing suboptimal rates and unclear dependence on lower-level condition number.

Method: Single-loop Stochastic Approximate Implicit Differentiation (SSAID) algorithm that concurrently updates lower and upper variables, analyzed with refined convergence analysis to establish theoretical guarantees.

Result: SSAID achieves ε-stationary point with oracle complexity O(κ⁷ε⁻²), matching optimal rate of state-of-the-art multi-loop methods while maintaining single-loop computational efficiency, with first explicit characterization of κ-dependence for stochastic AID-based single-loop methods.

Conclusion: SSAID provides rigorous theoretical foundation for single-loop stochastic bilevel optimization, demonstrating it’s not just heuristic but has convergence guarantees competitive with multi-loop frameworks, bridging theoretical gap between practical single-loop and theoretically-understood multi-loop methods.

Abstract: Stochastic Bilevel Optimization has emerged as a fundamental framework for meta-learning and hyperparameter optimization. Despite the practical prevalence of single-loop algorithms–which update lower and upper variables concurrently–their theoretical understanding, particularly in the stochastic regime, remains significantly underdeveloped compared to their multi-loop counterparts. Existing analyses often yield suboptimal convergence rates or obscure the critical dependence on the lower-level condition number $κ$, frequently burying it within generic Lipschitz constants. In this paper, we bridge this gap by providing a refined convergence analysis of the Single-loop Stochastic Approximate Implicit Differentiation (SSAID) algorithm. We prove that SSAID achieves an $ε$-stationary point with an oracle complexity of $\mathcal{O}(κ^7 ε^{-2})$. Our result is noteworthy in two aspects: (i) it matches the optimal $\mathcal{O}(ε^{-2})$ rate of state-of-the-art multi-loop methods (e.g., stocBiO) while maintaining the computational efficiency of a single-loop update; and (ii) it provides the first explicit, fine-grained characterization of the $κ$-dependence for stochastic AID-based single-loop methods. This work demonstrates that SSAID is not merely a heuristic approach, but admits a rigorous theoretical foundation with convergence guarantees competitive with mainstream multi-loop frameworks.

[397] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Zhihao Ding, Jinming Li, Ze Lu, Jieming Shi

Main category: cs.LG

TL;DR: FlexGuard: An LLM-based moderation system with continuous risk scoring and strictness-adaptive thresholding to handle varying enforcement requirements across platforms and time.

DetailsMotivation: Existing LLM content moderation models use fixed binary classification, assuming constant harmfulness definitions. In reality, enforcement strictness varies across platforms and evolves over time, making binary moderators brittle under shifting requirements.

Method: Proposes FlexGuard, an LLM-based moderator that outputs calibrated continuous risk scores reflecting severity. Uses risk-alignment optimization for score-severity consistency and provides threshold selection strategies to adapt to target strictness at deployment. Introduces FlexBench benchmark for controlled evaluation under multiple strictness regimes.

Result: Experiments on FlexBench and public benchmarks show FlexGuard achieves higher moderation accuracy and substantially improved robustness under varying strictness compared to existing moderators.

Conclusion: FlexGuard addresses the practical need for adaptable content moderation by providing continuous risk scoring and strictness-adaptive thresholding, making LLM safety systems more robust to evolving enforcement requirements.

Abstract: Ensuring the safety of LLM-generated content is essential for real-world deployment. Most existing guardrail models formulate moderation as a fixed binary classification task, implicitly assuming a fixed definition of harmfulness. In practice, enforcement strictness - how conservatively harmfulness is defined and enforced - varies across platforms and evolves over time, making binary moderators brittle under shifting requirements. We first introduce FlexBench, a strictness-adaptive LLM moderation benchmark that enables controlled evaluation under multiple strictness regimes. Experiments on FlexBench reveal substantial cross-strictness inconsistency in existing moderators: models that perform well under one regime can degrade substantially under others, limiting their practical usability. To address this, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score reflecting risk severity and supports strictness-specific decisions via thresholding. We train FlexGuard via risk-alignment optimization to improve score-severity consistency and provide practical threshold selection strategies to adapt to target strictness at deployment. Experiments on FlexBench and public benchmarks demonstrate that FlexGuard achieves higher moderation accuracy and substantially improved robustness under varying strictness. We release the source code and data to support reproducibility.

[398] FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA

Haoran Zhang, Dongjun Kim, Seohyeon Cha, Haris Vikalo

Main category: cs.LG

TL;DR: FedRot-LoRA addresses rotational misalignment in federated LoRA by aligning client updates via orthogonal transformations before aggregation, improving training stability and performance.

DetailsMotivation: Federated LoRA suffers from rotational misalignment where semantically equivalent updates are represented in different latent subspaces across clients, causing destructive interference during factor-wise averaging and unstable training.

Method: Proposes FedRot-LoRA framework that aligns client updates via orthogonal transformations prior to aggregation, preserving semantic updates while reducing cross-client subspace mismatch without increasing communication cost.

Result: Extensive experiments on natural language understanding and generative tasks show FedRot-LoRA consistently outperforms existing federated LoRA baselines across various heterogeneity levels and LoRA ranks.

Conclusion: Rotational alignment is crucial for stable federated LoRA training, and FedRot-LoRA provides an effective solution that improves performance without additional communication overhead.

Abstract: Federated LoRA provides a communication-efficient mechanism for fine-tuning large language models on decentralized data. In practice, however, a discrepancy between the factor-wise averaging used to preserve low rank and the mathematically correct aggregation of local updates can cause significant aggregation error and unstable training. We argue that a major source of this problem is rotational misalignment, arising from the rotational invariance of low-rank factorizations – semantically equivalent updates can be represented in different latent subspaces across clients since $(B_i R_i)(R_i^\top A_i) = B_i A_i$. When such misaligned factors are averaged directly, they interfere destructively and degrade the global update. To address this issue, we propose FedRot-LoRA, a federated LoRA framework that aligns client updates via orthogonal transformations prior to aggregation. This alignment preserves the semantic update while reducing cross-client subspace mismatch, without increasing communication cost or restricting model expressivity. We provide a convergence analysis that examines the aggregation error induced by factor-wise averaging and shows how rotational alignment yields a tighter upper bound on this error. Extensive experiments on natural language understanding and generative tasks demonstrate that FedRot-LoRA consistently outperforms existing federated LoRA baselines across a range of heterogeneity levels and LoRA ranks.

[399] Selective Denoising Diffusion Model for Time Series Anomaly Detection

Kohei Obata, Zheng Chen, Yasuko Matsubara, Lingwei Zhu, Yasushi Sakurai

Main category: cs.LG

TL;DR: AnomalyFilter is a novel diffusion-based method for time series anomaly detection that selectively filters anomaly parts while retaining normal parts through masked noise training and noise-free denoising.

DetailsMotivation: Existing diffusion-based methods for time series anomaly detection use conditional strategies that reconstruct input instances from white noise, but this approach struggles to accurately reconstruct normal parts, leading to suboptimal detection performance. The authors aim to create a more effective diffusion-based method specifically tailored for anomaly detection.

Method: Proposes AnomalyFilter, which acts as a selective filter that only denoises anomaly parts while retaining normal parts. The method masks Gaussian noise during training and conducts denoising without adding noise to instances. This noise design approach is specifically tailored for time series anomaly detection.

Result: Extensive experiments on five datasets demonstrate that AnomalyFilter achieves notably low reconstruction error on normal parts, providing empirical support for its effectiveness in anomaly detection. The synergy of the two simple components greatly enhances the performance of naive diffusion models.

Conclusion: AnomalyFilter represents a pioneering approach that focuses on the noise design of diffusion models specifically tailored for time series anomaly detection, offering improved performance over existing conditional diffusion methods.

Abstract: Time series anomaly detection (TSAD) has been an important area of research for decades, with reconstruction-based methods, mostly based on generative models, gaining popularity and demonstrating success. Diffusion models have recently attracted attention due to their advanced generative capabilities. Existing diffusion-based methods for TSAD rely on a conditional strategy, which reconstructs input instances from white noise with the aid of the conditioner. However, this poses challenges in accurately reconstructing the normal parts, resulting in suboptimal detection performance. In response, we propose a novel diffusion-based method, named AnomalyFilter, which acts as a selective filter that only denoises anomaly parts in the instance while retaining normal parts. To build such a filter, we mask Gaussian noise during the training phase and conduct the denoising process without adding noise to the instances. The synergy of the two simple components greatly enhances the performance of naive diffusion models. Extensive experiments on five datasets demonstrate that AnomalyFilter achieves notably low reconstruction error on normal parts, providing empirical support for its effectiveness in anomaly detection. AnomalyFilter represents a pioneering approach that focuses on the noise design of diffusion models specifically tailored for TSAD.

[400] Disentangled Mode-Specific Representations for Tensor Time Series via Contrastive Learning

Kohei Obata, Taichi Murayama, Zheng Chen, Yasuko Matsubara, Yasushi Sakurai

Main category: cs.LG

TL;DR: MoST is a novel representation learning method for multi-mode tensor time series that uses tensor slicing and contrastive learning to learn disentangled mode-specific and mode-invariant features.

DetailsMotivation: Multi-mode tensor time series (TTS) appear in many domains but are challenging to represent due to their complex tensor structure. Existing methods struggle to capture rich representations that disentangle different modes' features.

Method: MoST uses tensor slicing to reduce TTS complexity and learns disentangled representations for individual non-temporal modes. It employs contrastive learning with a two-part loss function: one for mode-specific features (relationships within same mode) and one for mode-invariant features (common across different modes), using disentangled representations as augmentations.

Result: Extensive experiments on real-world datasets show MoST consistently outperforms state-of-the-art methods in both classification and forecasting accuracy.

Conclusion: MoST effectively learns rich representations for TTS by disentangling mode-specific and mode-invariant features through tensor slicing and contrastive learning, demonstrating superior performance in downstream tasks.

Abstract: Multi-mode tensor time series (TTS) can be found in many domains, such as search engines and environmental monitoring systems. Learning representations of a TTS benefits various applications, but it is also challenging since the complexities inherent in the tensor hinder the realization of rich representations. In this paper, we propose a novel representation learning method designed specifically for TTS, namely MoST. Specifically, MoST uses a tensor slicing approach to reduce the complexity of the TTS structure and learns representations that can be disentangled into individual non-temporal modes. Each representation captures mode-specific features, which are the relationship between variables within the same mode, and mode-invariant features, which are in common in representations of different modes. We employ a contrastive learning framework to learn parameters; the loss function comprises two parts intended to learn representation in a mode-specific way and mode-invariant way, effectively exploiting disentangled representations as augmentations. Extensive experiments on real-world datasets show that MoST consistently outperforms the state-of-the-art methods in terms of classification and forecasting accuracy. Code is available at https://github.com/KoheiObata/MoST.

[401] Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Yongzhong Xu

Main category: cs.LG

TL;DR: Training trajectories in small transformers show dominant drift direction with residual oscillatory dynamics; optimizer choice (AdamW vs SGD) significantly affects trajectory geometry and effective dimensionality beyond loss values.

DetailsMotivation: To understand how different optimizers shape the geometry of training trajectories in transformer models beyond just loss values, examining the effective dimensionality and structure of parameter updates during training.

Method: Use uncentered, row-normalized trajectory PCA on small transformer models to analyze parameter updates, identify dominant drift directions, compare AdamW with SGD variants at matched loss levels, and study reheating effects on trajectory components.

Result: A single dominant direction captures most cumulative parameter movement early in training, with residual components encoding oscillatory probe performance. AdamW develops multi-dimensional drift structure while SGD produces nearly colinear evolution. Reheating selectively perturbs transverse components without affecting dominant drift.

Conclusion: Optimizer choice fundamentally shapes training trajectory geometry and effective dimensionality in transformers, with implications for understanding optimization dynamics beyond loss metrics alone.

Abstract: We study the geometry of training trajectories in small transformer models and find that parameter updates organize into a dominant drift direction with transverse residual dynamics. Using uncentered, row-normalized trajectory PCA, we show that a single direction captures a large fraction of cumulative parameter movement early in training, while remaining components encode oscillatory behavior in auxiliary probe performance. Instantaneous gradients exhibit little alignment with this dominant direction, indicating that it arises from accumulated optimizer updates rather than per-batch gradient structure. Comparing AdamW with SGD variants at matched loss levels reveals substantial differences in trajectory geometry: AdamW develops multi-dimensional drift structure, whereas SGD-family optimizers produce nearly colinear parameter evolution and weaker probe dynamics. Reheating selectively perturbs transverse components with minimal effect on the dominant drift coordinate. These findings suggest that optimizer choice shapes the effective dimensionality and structure of learning trajectories beyond what is apparent from loss values alone.

[402] Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning

Hanping Zhang, Yuhong Guo

Main category: cs.LG

TL;DR: BDGxRL uses Diffusion Schrödinger Bridge to align source domain transitions with target domain dynamics from offline demonstrations, enabling cross-domain RL without target environment interaction.

DetailsMotivation: Cross-domain RL faces challenges due to dynamics shifts between source and target domains, lack of target environment interaction, and absence of reward supervision, preventing direct policy learning.

Method: Proposes BDGxRL framework that leverages Diffusion Schrödinger Bridge to align source transitions with target-domain dynamics from offline demonstrations, plus reward modulation mechanism to estimate rewards based on state transitions for DSB-aligned samples.

Result: BDGxRL outperforms state-of-the-art baselines on MuJoCo cross-domain benchmarks and shows strong adaptability under transition dynamics shifts.

Conclusion: The framework enables target-oriented policy learning entirely within source domain without target environment access, effectively addressing cross-domain RL challenges.

Abstract: Cross-domain reinforcement learning (RL) aims to learn transferable policies under dynamics shifts between source and target domains. A key challenge lies in the lack of target-domain environment interaction and reward supervision, which prevents direct policy learning. To address this challenge, we propose Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL), a novel framework that leverages Diffusion Schrödinger Bridge (DSB) to align source transitions with target-domain dynamics encoded in offline demonstrations. Moreover, we introduce a reward modulation mechanism that estimates rewards based on state transitions, applying to DSB-aligned samples to ensure consistency between rewards and target-domain dynamics. BDGxRL performs target-oriented policy learning entirely within the source domain, without access to the target environment or its rewards. Experiments on MuJoCo cross-domain benchmarks demonstrate that BDGxRL outperforms state-of-the-art baselines and shows strong adaptability under transition dynamics shifts.

[403] OPTIAGENT: A Physics-Driven Agentic Framework for Automated Optical Design

Yuyu Geng, Lei Sun, Yao Gao, Xinxin Hu, Zhonghua Yi, Xiaolong Qian, Weijian Hu, Jian Bai, Kaiwei Wang

Main category: cs.LG

TL;DR: LLM-powered optical design system that enables non-experts to create functional lens systems through domain-specific training and physics-guided optimization.

DetailsMotivation: Optical design is complex, non-convex, and requires expert knowledge. LLMs have optical knowledge but can't effectively apply it to lens design. The goal is to bridge the expertise gap for non-experts.

Method: Created OptiDesignQA dataset with classical and novel lens systems. Used hybrid objective of full-system synthesis and lens completion. Employed DrGRPO with Optical Lexicographic Reward (structural format, physical feasibility, light-manipulation accuracy, LLM-based heuristics). Integrated with optical optimization routines for end-to-end fine-tuning.

Result: Superior performance compared to traditional optimization-based automated design algorithms and other LLM counterparts.

Conclusion: First successful application of LLMs to optical design, enabling non-experts to develop functional lens systems through domain-specific training and physics-guided alignment.

Abstract: Optical design is the process of configuring optical elements to precisely manipulate light for high-fidelity imaging. It is inherently a highly non-convex optimization problem that relies heavily on human heuristic expertise and domain-specific knowledge. While Large Language Models (LLMs) possess extensive optical knowledge, their capabilities in leveraging the knowledge in designing lens system remain significantly constrained. This work represents the first attempt to employ LLMs in the field of optical design. We bridge the expertise gap by enabling users without formal optical training to successfully develop functional lens systems. Concretely, we curate a comprehensive dataset, named OptiDesignQA, which encompasses both classical lens systems sourced from standard optical textbooks and novel configurations generated by automated design algorithms for training and evaluation. Furthermore, we inject domain-specific optical expertise into the LLM through a hybrid objective of full-system synthesis and lens completion. To align the model with optical principles, we employ Group Relative Policy Optimization Done Right (DrGRPO) guided by Optical Lexicographic Reward for physics-driven policy alignment. This reward system incorporates structural format rewards, physical feasibility rewards, light-manipulation accuracy, and LLM-based heuristics. Finally, our model integrates with specialized optical optimization routines for end-to-end fine-tuning and precision refinement. We benchmark our proposed method against both traditional optimization-based automated design algorithms and LLM counterparts, and experimental results show the superiority of our method.

[404] MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning

Chenxing Lin, Xinhui Gao, Haipeng Zhang, Xinran Li, Haitao Wang, Songzhu Mei, Chenglu Wen, Weiquan Liu, Siqi Shen, Cheng Wang

Main category: cs.LG

TL;DR: MAGE is a multi-scale autoregressive generation method for offline RL that uses hierarchical trajectory representations and conditional guidance to handle long-horizon sparse-reward tasks.

DetailsMotivation: Existing generative approaches in offline RL struggle with long-horizon tasks with sparse rewards. While hierarchical methods help by decomposing problems, they often overlook the multi-scale temporal structure of trajectories, leading to suboptimal performance.

Method: MAGE incorporates: 1) a condition-guided multi-scale autoencoder to learn hierarchical trajectory representations, 2) a multi-scale transformer that autoregressively generates trajectory representations from coarse to fine temporal scales, and 3) a condition-guided decoder for precise control over short-term behaviors.

Result: Extensive experiments on five offline RL benchmarks against fifteen baseline algorithms show that MAGE successfully integrates multi-scale trajectory modeling with conditional guidance, generating coherent and controllable trajectories in long-horizon sparse-reward settings.

Conclusion: MAGE effectively captures temporal dependencies at multiple resolutions and demonstrates superior performance in handling complex long-horizon tasks with sparse rewards through its multi-scale autoregressive generation approach.

Abstract: Generative models have gained significant traction in offline reinforcement learning (RL) due to their ability to model complex trajectory distributions. However, existing generation-based approaches still struggle with long-horizon tasks characterized by sparse rewards. Some hierarchical generation methods have been developed to mitigate this issue by decomposing the original problem into shorter-horizon subproblems using one policy and generating detailed actions with another. While effective, these methods often overlook the multi-scale temporal structure inherent in trajectories, resulting in suboptimal performance. To overcome these limitations, we propose MAGE, a Multi-scale Autoregressive GEneration-based offline RL method. MAGE incorporates a condition-guided multi-scale autoencoder to learn hierarchical trajectory representations, along with a multi-scale transformer that autoregressively generates trajectory representations from coarse to fine temporal scales. MAGE effectively captures temporal dependencies of trajectories at multiple resolutions. Additionally, a condition-guided decoder is employed to exert precise control over short-term behaviors. Extensive experiments on five offline RL benchmarks against fifteen baseline algorithms show that MAGE successfully integrates multi-scale trajectory modeling with conditional guidance, generating coherent and controllable trajectories in long-horizon sparse-reward settings.

[405] TradeFM: A Generative Foundation Model for Trade-flow and Market Microstructure

Maxime Kawawa-Beaudan, Srijan Sood, Kassiani Papasotiriou, Daniel Borrajo, Manuela Veloso

Main category: cs.LG

TL;DR: TradeFM is a 524M-parameter generative Transformer for market microstructure that learns from billions of trade events across thousands of equities using scale-invariant features and universal tokenization for cross-asset generalization.

DetailsMotivation: To bring the foundation model paradigm to market microstructure by learning general-purpose representations from large-scale, heterogeneous trade data, enabling cross-asset generalization without asset-specific calibration.

Method: Develops scale-invariant features and a universal tokenization scheme to map heterogeneous, multi-modal order flow event streams into unified discrete sequences. Uses a 524M-parameter generative Transformer trained on billions of trade events across >9,000 equities, integrated with a deterministic market simulator.

Result: TradeFM-generated rollouts reproduce key financial stylized facts (heavy tails, volatility clustering, no return autocorrelation). Achieves 2-3x lower distributional error than Compound Hawkes baselines and generalizes zero-shot to geographically out-of-distribution APAC markets with moderate perplexity degradation.

Conclusion: Scale-invariant trade representations capture transferable structure in market microstructure, opening paths for synthetic data generation, stress testing, and learning-based trading agents.

Abstract: Foundation models have transformed domains from language to genomics by learning general-purpose representations from large-scale, heterogeneous data. We introduce TradeFM, a 524M-parameter generative Transformer that brings this paradigm to market microstructure, learning directly from billions of trade events across >9K equities. To enable cross-asset generalization, we develop scale-invariant features and a universal tokenization scheme that map the heterogeneous, multi-modal event stream of order flow into a unified discrete sequence – eliminating asset-specific calibration. Integrated with a deterministic market simulator, TradeFM-generated rollouts reproduce key stylized facts of financial returns, including heavy tails, volatility clustering, and absence of return autocorrelation. Quantitatively, TradeFM achieves 2-3x lower distributional error than Compound Hawkes baselines and generalizes zero-shot to geographically out-of-distribution APAC markets with moderate perplexity degradation. Together, these results suggest that scale-invariant trade representations capture transferable structure in market microstructure, opening a path toward synthetic data generation, stress testing, and learning-based trading agents.

[406] Provable Subspace Identification of Nonlinear Multi-view CCA

Zhiwei Han, Stefan Matthes, Hao Shen

Main category: cs.LG

TL;DR: Nonlinear CCA identifiability analysis showing multi-view CCA recovers correlated signal subspaces up to orthogonal ambiguity, with theoretical guarantees and experimental validation.

DetailsMotivation: To understand the identifiability of nonlinear Canonical Correlation Analysis in multi-view setups where each view is generated by unknown nonlinear maps applied to linear mixtures of shared latents and view-private noise.

Method: Reframe multi-view CCA as a basis-invariant subspace identification problem rather than exact unmixing. Prove that under suitable latent priors and spectral separation conditions, multi-view CCA recovers pairwise correlated signal subspaces up to view-wise orthogonal ambiguity. For N≥3 views, the objective isolates jointly correlated subspaces while eliminating view-private variations.

Result: Established finite-sample consistency guarantees by translating concentration of empirical cross-covariances into explicit subspace error bounds via spectral perturbation theory. Experiments on synthetic and rendered image datasets validate theoretical findings and confirm necessity of assumed conditions.

Conclusion: Multi-view CCA can provably recover correlated signal subspaces despite nonlinear transformations, providing theoretical foundation for subspace identification in nonlinear multi-view learning.

Abstract: We investigate the identifiability of nonlinear Canonical Correlation Analysis (CCA) in a multi-view setup, where each view is generated by an unknown nonlinear map applied to a linear mixture of shared latents and view-private noise. Rather than attempting exact unmixing, a problem proven to be ill-posed, we instead reframe multi-view CCA as a basis-invariant subspace identification problem. We prove that, under suitable latent priors and spectral separation conditions, multi-view CCA recovers the pairwise correlated signal subspaces up to view-wise orthogonal ambiguity. For $N \geq 3$ views, the objective provably isolates the jointly correlated subspaces shared across all views while eliminating view-private variations. We further establish finite-sample consistency guarantees by translating the concentration of empirical cross-covariances into explicit subspace error bounds via spectral perturbation theory. Experiments on synthetic and rendered image datasets validate our theoretical findings and confirm the necessity of the assumed conditions.

[407] UPath: Universal Planner Across Topological Heterogeneity For Grid-Based Pathfinding

Aleksandr Ananikian, Daniil Drozdov, Konstantin Yakovlev

Main category: cs.LG

TL;DR: A universal heuristic predictor for grid-based pathfinding that generalizes across unseen tasks, trained once but capable of handling diverse problem instances, reducing A* computational effort by up to 2.2x while maintaining near-optimal solutions.

DetailsMotivation: Existing learning-based heuristic approaches for pathfinding (like A*) assume training and test maps come from the same distribution, limiting practical application where universal solvers are needed for diverse problem instances.

Method: Designs a universal heuristic predictor model trained once but capable of generalizing across a full spectrum of unseen tasks, using deep neural networks to approximate informed heuristics that consider obstacle positions/shapes.

Result: The approach halves A* computational effort by up to a factor of 2.2 while providing solutions within 3% of optimal cost on average, even on tasks completely different from training data.

Conclusion: Achieves a milestone as the first learnable solver to generalize effectively across diverse pathfinding tasks, enabling practical universal pathfinding solutions.

Abstract: The performance of search algorithms for grid-based pathfinding, e.g. A*, critically depends on the heuristic function that is used to focus the search. Recent studies have shown that informed heuristics that take the positions/shapes of the obstacles into account can be approximated with the deep neural networks. Unfortunately, the existing learning-based approaches mostly rely on the assumption that training and test grid maps are drawn from the same distribution (e.g., city maps, indoor maps, etc.) and perform poorly on out-of-distribution tasks. This naturally limits their application in practice when often a universal solver is needed that is capable of efficiently handling any problem instance. In this work, we close this gap by designing an universal heuristic predictor: a model trained once, but capable of generalizing across a full spectrum of unseen tasks. Our extensive empirical evaluation shows that the suggested approach halves the computational effort of A* by up to a factor of 2.2, while still providing solutions within 3% of the optimal cost on average altogether on the tasks that are completely different from the ones used for training $\unicode{x2013}$ a milestone reached for the first time by a learnable solver.

[408] GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks

Wenwu Tang, Dong Wang, Lothar Thiele, Olga Saukh

Main category: cs.LG

TL;DR: GRAIL is a zero-finetuning post-compression method that restores model accuracy using calibration data and ridge regression to reconstruct hidden representations.

DetailsMotivation: Deep model compression often requires post-compression finetuning which can be impractical due to missing labeled data or high training costs. There's a need for methods that restore accuracy without expensive finetuning.

Method: Post-hoc blockwise compensation using Gram matrices to summarize hidden activations, applying ridge regression to linearly reconstruct original hidden representations from compressed ones, then absorbing the reconstruction map into downstream weights.

Result: Consistently improves accuracy/perplexity over data-free and data-aware pruning/folding baselines across ResNets, ViTs, and decoder-only LLMs in practical compression regimes with manageable overhead and no backpropagation.

Conclusion: GRAIL provides an effective zero-finetuning approach for model compression that works across various architectures and compression methods, requiring only a small calibration set without gradients or labels.

Abstract: Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block’s input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at https://github.com/TWWinde/GRAIL.

[409] MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Yong Jiang, Fei Huang, Wei Yang Bryan Lim

Main category: cs.LG

TL;DR: MPU: A privacy-preserving framework for machine unlearning in LLMs that addresses dual non-disclosure constraints by distributing perturbed model copies to clients for local unlearning, then aggregating updates with denoising.

DetailsMotivation: Address the privacy dilemma in machine unlearning where neither server parameters nor client forget sets can be shared due to strict constraints, requiring a solution that protects both parties' privacy.

Method: Proposes MPU with two server-side modules: Pre-Process generates multiple perturbed and reparameterized model instances for clients to perform local unlearning; Post-Process inverts reparameterization and aggregates updates with harmonic denoising to mitigate perturbation effects.

Result: MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms showing average degradation below 1% under 10% noise, and can even outperform noise-free baselines for some algorithms under 1% noise.

Conclusion: MPU effectively addresses dual non-disclosure constraints in machine unlearning while maintaining performance comparable to traditional approaches, offering a practical privacy-preserving solution.

Abstract: Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server’s parameters or the client’s forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server’s exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms’ average degradation well below 1% under 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan-SHU/MPU.

[410] Actor-Critic Pretraining for Proximal Policy Optimization

Andreas Kernbach, Amr Elsheikh, Nicolas Grupp, René Nagel, Marco F. Huber

Main category: cs.LG

TL;DR: Actor-critic pretraining using expert demonstrations improves RL sample efficiency for robotics tasks by initializing both actor and critic networks, outperforming actor-only pretraining.

DetailsMotivation: RL actor-critic algorithms require many environment interactions, limiting robotics applications. While actor pretraining via behavioral cloning is common, critic initialization is often neglected despite its importance in policy optimization.

Method: Proposes actor-critic pretraining for algorithms like PPO: actor pretrained via behavioral cloning on expert demonstrations, critic pretrained using returns from rollouts of the pretrained policy.

Result: Evaluated on 15 simulated robotic manipulation and locomotion tasks. Actor-critic pretraining improves sample efficiency by 86.1% vs no pretraining and 30.9% vs actor-only pretraining.

Conclusion: Pretraining both actor and critic networks with expert data significantly enhances RL sample efficiency for robotics, demonstrating the importance of proper critic initialization.

Abstract: Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and locomotion tasks. Experimental results show that actor-critic pretraining improves sample efficiency by 86.1% on average compared to no pretraining and by 30.9% to actor-only pretraining.

[411] Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Xiang Li, Nan Jiang, Yuheng Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of offline RL with general function approximation, extending mirror descent to parameterized policies over large/continuous action spaces and connecting it to natural policy gradient.

DetailsMotivation: Existing offline RL algorithms with theoretical guarantees are computationally tractable only for finite/small action spaces and rely on state-wise mirror descent that requires actors to be implicitly induced from critics, failing to accommodate standalone policy parameterization which is common in practice.

Method: Extends mirror descent to parameterized policy classes over large/continuous action spaces by addressing contextual coupling difficulty. Connects mirror descent to natural policy gradient, leading to novel analyses, guarantees, and algorithmic insights.

Result: Provides theoretical guarantees for offline RL with parameterized policy classes over large/continuous action spaces, showing how connecting mirror descent to natural policy gradient enables novel analyses and algorithmic insights, including unification between offline RL and imitation learning.

Conclusion: The work addresses limitations of existing offline RL theory by extending guarantees to practical policy parameterizations, revealing connections between mirror descent and natural policy gradient, and providing theoretical foundations for more practical offline RL algorithms.

Abstract: We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.

[412] Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

George Papadopoulos, George A. Vouros

Main category: cs.LG

TL;DR: SafeQIL: A safe Q-learning approach for inverse constrained reinforcement learning that learns policies from demonstrations in constrained MDPs with unknown constraints, balancing reward maximization with safety.

DetailsMotivation: The paper addresses the challenge of learning from demonstrations in constrained Markov Decision Processes (MDPs) where constraints are unknown and costs are non-observable. The goal is to find policies that maximize the likelihood of demonstrated trajectories while being conservative enough to avoid unsafe steps, balancing between being safe and achieving high rewards.

Method: The authors formulate the “promise” of state-action pairs using Q-values that incorporate both task-specific rewards and safety assessments. They develop SafeQIL (Safe Q Inverse Constrained Reinforcement Learning), which takes a safe Q-learning perspective on the inverse learning problem under constraints. The method learns from demonstrations to infer policies that trade off between reward maximization and safety.

Result: SafeQIL is compared to state-of-the-art inverse constraint reinforcement learning algorithms on challenging benchmark tasks, demonstrating its merits in learning safe policies from demonstrations in constrained environments with unknown constraints.

Conclusion: The paper presents a novel safe Q-learning approach for inverse constrained reinforcement learning that effectively learns policies from demonstrations while balancing safety and reward objectives, showing advantages over existing methods on benchmark tasks.

Abstract: Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states’ safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.

[413] Inferring Chronic Treatment Onset from ePrescription Data: A Renewal Process Approach

Pavlin G. Poličar, Dalibor Stanimirović, Blaž Zupan

Main category: cs.LG

TL;DR: Probabilistic framework uses prescription renewal dynamics to infer chronic disease onset, outperforming rule-based methods on left-censored EHR data.

DetailsMotivation: Longitudinal EHR data suffers from left-censoring, making diagnosis records unreliable for determining disease onset. Prescription trajectories provide continuous signals of disease management that can be leveraged for more accurate onset inference.

Method: Models prescription dynamics as a renewal process with change-point detection between baseline Poisson (sporadic prescribing) and regime-specific Weibull (sustained therapy) renewal models to detect transitions to chronic treatment.

Result: Applied to nationwide ePrescription dataset of 2.4M individuals, yields more temporally plausible onset estimates than naive rule-based triggering, reducing implausible early detections under strong left censoring. Performance varies by disease and correlates with prescription density.

Conclusion: Treatment-based onset inference using prescription renewal dynamics provides more reliable estimates than diagnosis records for left-censored EHR data, though effectiveness depends on disease-specific prescription patterns.

Abstract: Longitudinal electronic health record (EHR) data are often left-censored, making diagnosis records incomplete and unreliable for determining disease onset. In contrast, outpatient prescriptions form renewal-based trajectories that provide a continuous signal of disease management. We propose a probabilistic framework to infer chronic treatment onset by modeling prescription dynamics as a renewal process and detecting transitions from sporadic to sustained therapy via change-point detection between a baseline Poisson (sporadic prescribing) regime and a regime-specific Weibull (sustained therapy) renewal model. Using a nationwide ePrescription dataset of 2.4 million individuals, we show that the approach yields more temporally plausible onset estimates than naive rule-based triggering, substantially reducing implausible early detections under strong left censoring. Detection performance varies across diseases and is strongly associated with prescription density, highlighting both the strengths and limits of treatment-based onset inference.

[414] FedNSAM:Consistency of Local and Global Flatness for Federated Learning

Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, Yuanyuan Liu

Main category: cs.LG

TL;DR: FedNSAM improves federated learning by using global Nesterov momentum to align local and global model flatness, addressing data heterogeneity issues that degrade generalization.

DetailsMotivation: In federated learning, multi-step local updates and data heterogeneity lead to sharper global minima, degrading model performance. While SAM helps with local flatness, it doesn't guarantee global flatness due to data heterogeneity, creating a "flatness distance" problem.

Method: FedNSAM introduces global Nesterov momentum into local updates to harmonize global and local flatness consistency. It uses global Nesterov momentum as the direction for local estimation of client global perturbations and extrapolation.

Result: Theoretically proves tighter convergence bound than FedSAM via Nesterov extrapolation. Empirically shows superior performance and efficiency on CNN and Transformer models through comprehensive experiments.

Conclusion: FedNSAM effectively addresses the flatness distance problem in federated learning by aligning local and global model flatness through global Nesterov momentum, improving generalization and convergence.

Abstract: In federated learning (FL), multi-step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness-aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the \textbf{flatness distance} to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the \textbf{flatness distance}, we propose a novel \textbf{FedNSAM} algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. \textbf{FedNSAM} uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of \textbf{FedNSAM}. The code is available at https://github.com/junkangLiu0/FedNSAM.

[415] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev

Main category: cs.LG

TL;DR: Proposes LK losses for training draft models in speculative decoding that directly optimize acceptance rate instead of KL divergence, achieving 8-10% improvements in average acceptance length across various model sizes and domains.

DetailsMotivation: Standard speculative decoding uses KL divergence as a proxy objective for maximizing acceptance rate, but small draft models with limited capacity converge to suboptimal solutions where minimizing KL doesn't guarantee maximizing acceptance rate.

Method: Introduces LK losses (special training objectives) that directly target acceptance rate rather than using KL divergence as a proxy. These losses are easy to implement, introduce no computational overhead, and can be integrated into existing speculator training frameworks.

Result: Comprehensive experiments across 4 draft architectures and 6 target models (8B to 685B parameters) show consistent improvements in acceptance metrics. Gains of 8-10% in average acceptance length across general, coding, and math domains.

Conclusion: LK losses provide a compelling alternative to standard KL-based training for draft models in speculative decoding, directly optimizing for acceptance rate and achieving significant speedup improvements across diverse model configurations and domains.

Abstract: Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

[416] ULW-SleepNet: An Ultra-Lightweight Network for Multimodal Sleep Stage Scoring

Zhaowen Wang, Dongdong Zhou, Qi Xu, Fengyu Cong, Mohammad Al-Sa’d, Jenni Raitoharju

Main category: cs.LG

TL;DR: ULW-SleepNet: Ultra-lightweight multimodal sleep stage scoring framework using novel Dual-Stream Separable Convolution blocks for efficient integration of multiple physiological signals with minimal computational overhead.

DetailsMotivation: Existing deep learning models for sleep stage scoring are computationally demanding and often limited to single-channel EEG, making them impractical for multimodal polysomnography data in real-time wearable/IoT applications.

Method: Proposes ULW-SleepNet with Dual-Stream Separable Convolution (DSSC) blocks, depthwise separable convolutions, channel-wise parameter sharing, and global average pooling to dramatically reduce parameters and FLOPs while maintaining accuracy.

Result: Achieves 86.9% accuracy on Sleep-EDF-20 and 81.4% on Sleep-EDF-78 with only 13.3K parameters and 7.89M FLOPs, reducing parameters by up to 98.6% compared to state-of-the-art methods with marginal performance loss.

Conclusion: ULW-SleepNet demonstrates strong potential for real-time sleep monitoring on wearable and IoT devices by efficiently integrating multimodal physiological signals with ultra-lightweight architecture.

Abstract: Automatic sleep stage scoring is crucial for the diagnosis and treatment of sleep disorders. Although deep learning models have advanced the field, many existing models are computationally demanding and designed for single-channel electroencephalography (EEG), limiting their practicality for multimodal polysomnography (PSG) data. To overcome this, we propose ULW-SleepNet, an ultra-lightweight multimodal sleep stage scoring framework that efficiently integrates information from multiple physiological signals. ULW-SleepNet incorporates a novel Dual-Stream Separable Convolution (DSSC) Block, depthwise separable convolutions, channel-wise parameter sharing, and global average pooling to reduce computational overhead while maintaining competitive accuracy. Evaluated on the Sleep-EDF-20 and Sleep-EDF-78 datasets, ULW-SleepNet achieves accuracies of 86.9% and 81.4%, respectively, with only 13.3K parameters and 7.89M FLOPs. Compared to state-of-the-art methods, our model reduces parameters by up to 98.6% with only marginal performance loss, demonstrating its strong potential for real-time sleep monitoring on wearable and IoT devices. The source code for this study is publicly available at https://github.com/wzw999/ULW-SLEEPNET.

[417] A Theory of Random Graph Shift in Truncated-Spectrum vRKHS

Zhang Wan, Tingting Mu, Samuel Kaski

Main category: cs.LG

TL;DR: Theoretical framework for graph classification under domain shift using random graph models, with generalization bounds factoring domain discrepancy, spectral geometry, and amplitude terms.

DetailsMotivation: Existing domain adaptation theories don't adequately handle the structured nature of graph data and specialized graph learning architectures, making fine-grained analysis of graph distribution shifts challenging.

Method: Proposes a theory using random graph models as data generative process, connecting to hypothesis complexity via vector-valued reproducing kernel Hilbert space (vRKHS) formulation to derive generalization bounds.

Result: Derived generalization bound with shift penalty factorized into: (1) domain discrepancy term, (2) spectral-geometry term from truncated spectrum, and (3) amplitude term aggregating convergence and stability effects.

Conclusion: Provides theoretical framework for analyzing graph domain shift with empirical validation on real and simulated data, offering insights into factors affecting generalization in graph classification.

Abstract: This paper develops a theory of graph classification under domain shift through a random-graph generative lens, where we consider intra-class graphs sharing the same random graph model (RGM) and the domain shift induced by changes in RGM components. While classic domain adaptation (DA) theories have well-underpinned existing techniques to handle graph distribution shift, the information of graph samples, which are itself structured objects, is less explored. The non-Euclidean nature of graphs and specialized architectures for graph learning further complicate a fine-grained analysis of graph distribution shifts. In this paper, we propose a theory that assumes RGM as the data generative process, exploiting its connection to hypothesis complexity in function space perspective for such fine-grained analysis. Building on a vector-valued reproducing kernel Hilbert space (vRKHS) formulation, we derive a generalization bound whose shift penalty admits a factorization into (i) a domain discrepancy term, (ii) a spectral-geometry term summarized by the accessible truncated spectrum, and (iii) an amplitude term that aggregates convergence and construction-stability effects. We empirically verify the insights on these terms in both real data and simulations.

[418] Hierarchical Concept-based Interpretable Models

Oscar Hill, Mateo Espinosa Zarlenga, Mateja Jamnik

Main category: cs.LG

TL;DR: HiCEMs introduce hierarchical concept structures to improve interpretability of neural networks, with automatic concept splitting to discover finer-grained sub-concepts without additional annotations.

DetailsMotivation: Current Concept Embedding Models (CEMs) lack representation of inter-concept relationships and require extensive concept annotations at different granularities during training, limiting their practical applicability for model interpretation and debugging.

Method: Proposes Hierarchical Concept Embedding Models (HiCEMs) that explicitly model concept relationships through hierarchical structures, and introduces Concept Splitting to automatically discover finer-grained sub-concepts from pretrained CEM embedding spaces without needing additional annotations.

Result: Concept Splitting successfully discovers human-interpretable sub-concepts absent during training, enabling training of accurate HiCEMs. HiCEMs allow powerful test-time concept interventions at different granularities, leading to improved task accuracy across multiple datasets including the new PseudoKitchens dataset.

Conclusion: HiCEMs with Concept Splitting provide a scalable approach to model interpretability that reduces annotation burden while enabling hierarchical concept understanding and intervention capabilities.

Abstract: Modern deep neural networks remain challenging to interpret due to the opacity of their latent representations, impeding model understanding, debugging, and debiasing. Concept Embedding Models (CEMs) address this by mapping inputs to human-interpretable concept representations from which tasks can be predicted. Yet, CEMs fail to represent inter-concept relationships and require concept annotations at different granularities during training, limiting their applicability. In this paper, we introduce Hierarchical Concept Embedding Models (HiCEMs), a new family of CEMs that explicitly model concept relationships through hierarchical structures. To enable HiCEMs in real-world settings, we propose Concept Splitting, a method for automatically discovering finer-grained sub-concepts from a pretrained CEM’s embedding space without requiring additional annotations. This allows HiCEMs to generate fine-grained explanations from limited concept labels, reducing annotation burdens. Our evaluation across multiple datasets, including a user study and experiments on PseudoKitchens, a newly proposed concept-based dataset of 3D kitchen renders, demonstrates that (1) Concept Splitting discovers human-interpretable sub-concepts absent during training that can be used to train highly accurate HiCEMs, and (2) HiCEMs enable powerful test-time concept interventions at different granularities, leading to improved task accuracy.

[419] RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie, Ido Hakimi, Barna Pásztor, Andreas Krause

Main category: cs.LG

TL;DR: RewardUQ: A unified framework for evaluating uncertainty quantification in reward models for LLM alignment, comparing methods on accuracy and calibration metrics.

DetailsMotivation: Current reward models for LLM alignment rely on pointwise reward estimates that ignore epistemic uncertainty from limited human feedback, and uncertainty-aware models have been adopted without thorough comparison or understanding.

Method: Introduces RewardUQ framework to systematically evaluate uncertainty quantification methods for reward models, comparing common approaches using standard accuracy and calibration metrics, and proposing a new ranking strategy combining both dimensions.

Result: Experimental results show model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices.

Conclusion: Provides an open-source framework (RewardUQ) to foster development and evaluation of uncertainty quantification methods for reward models, aiding deployment in downstream applications like active learning and mitigating reward overoptimization.

Abstract: Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open-source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.

[420] Learning Generation Orders for Masked Discrete Diffusion Models via Variational Inference

David Fox, Sam Bowyer, Song Liu, Laurence Aitchison, Raul Santos-Rodriguez, Mengyue Yang

Main category: cs.LG

TL;DR: A variational inference framework for learning parallel generation orders in masked discrete diffusion models, enabling efficient parallel token generation while maintaining sample quality.

DetailsMotivation: Masked discrete diffusion models offer parallel generation efficiency but struggle with balancing parallelism and sample quality. Current approaches use fixed heuristic methods, and learning-based approaches lack proper variational inference formulation.

Method: Proposes a variational inference framework for learning parallel generation orders, with a parameterization for approximate posterior of generation orders that facilitates parallelism and efficient sampling during training.

Result: On GSM8K dataset, achieves 33.1% accuracy with average of only 4 generation steps, outperforming standard competitor methods (23.7-29.0% accuracy) in same number of steps.

Conclusion: The variational inference approach shows promise for parallel generation in masked discrete diffusion models, with competitive performance against heuristic methods in highly parallel regimes.

Abstract: Masked discrete diffusion models (MDMs) are a promising new approach to generative modelling, offering the ability for parallel token generation and therefore greater efficiency than autoregressive counterparts. However, achieving an optimal balance between parallel generation and sample quality remains an open problem. Current approaches primarily address this issue through fixed, heuristic parallel sampling methods. There exist some recent learning based approaches to this problem, but its formulation from the perspective of variational inference remains underexplored. In this work, we propose a variational inference framework for learning parallel generation orders for MDMs. As part of our method, we propose a parameterisation for the approximate posterior of generation orders which facilitates parallelism and efficient sampling during training. Using this method, we conduct preliminary experiments on the GSM8K dataset, where our method performs competitively against heuristic sampling strategies in the regime of highly parallel generation. For example, our method achieves 33.1% accuracy with an average of only only 4 generation steps, compared to 23.7-29.0% accuracy achieved by standard competitor methods in the same number of steps. We believe further experiments and analysis of the method will yield valuable insights into the problem of parallel generation with MDMs.

[421] Intrinsic Lorentz Neural Network

Xianglong Shi, Ziheng Chen, Yunhan Jiang, Nicu Sebe

Main category: cs.LG

TL;DR: ILNN is a fully intrinsic hyperbolic neural network that conducts all computations within the Lorentz model, achieving state-of-the-art performance on image and genomic datasets.

DetailsMotivation: Real-world data often has latent hierarchical structures that are naturally represented by hyperbolic geometry. Existing hyperbolic neural networks are often partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations, which limits their effectiveness.

Method: Proposes Intrinsic Lorentz Neural Network (ILNN) with a novel point-to-hyperplane fully connected layer that uses closed-form hyperbolic distances from features to learned Lorentz hyperplanes. Includes intrinsic modules: GyroLBN (Lorentz batch normalization), gyro-additive bias, Lorentz patch-concatenation operator with digamma-based scale, and Lorentz dropout layer.

Result: ILNN achieves state-of-the-art performance and computational cost among hyperbolic models on CIFAR-10/100 and two genomic benchmarks (TEB and GUE), consistently surpassing strong Euclidean baselines.

Conclusion: ILNN demonstrates that fully intrinsic hyperbolic architectures can effectively capture hierarchical structures in data while being computationally efficient and outperforming both hyperbolic and Euclidean baselines.

Abstract: Real-world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emph{Intrinsic Lorentz Neural Network} (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emph{point-to-hyperplane} fully connected layer (FC), replacing traditional Euclidean affine logits with closed-form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature. Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro-centering with gyro-scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro-additive bias for the FC output, a Lorentz patch-concatenation operator that aligns the expected log-radius across feature blocks via a digamma-based scale, and a Lorentz dropout layer. Extensive experiments conducted on CIFAR-10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state-of-the-art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines. The code is available at \href{https://github.com/Longchentong/ILNN}{\textcolor{magenta}{this url}}.

[422] MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer’s Screening

Vrushank Ahire, Yogesh Kumar, Anouck Girard, M. A. Ganaie

Main category: cs.LG

TL;DR: MINT transfers MRI biomarker knowledge to speech encoder for Alzheimer’s screening without needing MRI at inference

DetailsMotivation: MRI biomarkers for Alzheimer's are expensive and inaccessible; speech analysis is non-invasive but lacks biological grounding. Need to transfer MRI decision boundaries to speech for reliable early detection.

Method: Three-stage cross-modal framework: 1) Train MRI teacher on 1,228 subjects, 2) Use residual projection head to align speech representations to frozen MRI embedding space via geometric loss, 3) Apply frozen MRI classifier to aligned speech embeddings at inference.

Result: Aligned speech achieves AUC 0.720 vs 0.711 for speech-only baselines, comparable performance without imaging at inference. Multimodal fusion improves over MRI alone (0.973 vs 0.958).

Conclusion: First demonstration of MRI-to-speech knowledge transfer for Alzheimer’s screening, enabling biologically grounded population-level cognitive triage without neuroimaging infrastructure.

Abstract: Alzheimer’s disease is a progressive neurodegenerative disorder in which mild cognitive impairment (MCI) marks a critical transition between aging and dementia. Neuroimaging modalities, such as structural MRI, provide biomarkers of this transition; however, their high costs and infrastructure needs limit their deployment at a population scale. Speech analysis offers a non-invasive alternative, but speech-only classifiers are developed independently of neuroimaging, leaving decision boundaries biologically ungrounded and limiting reliability on the subtle CN-versus-MCI distinction. We propose MINT (Multimodal Imaging-to-Speech Knowledge Transfer), a three-stage cross-modal framework that transfers biomarker structure from MRI into a speech encoder at training time. An MRI teacher, trained on 1,228 subjects, defines a compact neuroimaging embedding space for CN-versus-MCI classification. A residual projection head aligns speech representations to this frozen imaging manifold via a combined geometric loss, adapting speech to the learned biomarker space while preserving imaging encoder fidelity. The frozen MRI classifier, which is never exposed to speech, is applied to aligned embeddings at inference and requires no scanner. Evaluation on ADNI-4 shows aligned speech achieves performance comparable to speech-only baselines (AUC 0.720 vs 0.711) while requiring no imaging at inference, demonstrating that MRI-derived decision boundaries can ground speech representations. Multimodal fusion improves over MRI alone (0.973 vs 0.958). Ablation studies identify dropout regularization and self-supervised pretraining as critical design decisions. To our knowledge, this is the first demonstration of MRI-to-speech knowledge transfer for early Alzheimer’s screening, establishing a biologically grounded pathway for population-level cognitive triage without neuroimaging at inference.

[423] Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Florent Delgrange

Main category: cs.LG

TL;DR: Foundation world models for autonomous agents that combine reinforcement learning, program synthesis, and formal verification to enable reliable adaptation in open worlds.

DetailsMotivation: Current autonomous agents struggle with open-world adaptation as they assume fixed tasks and environments, limiting their ability to evolve policies when conditions change. There's a need for persistent, compositional representations that can support reliable adaptation.

Method: Proposes a framework with four components: (1) learnable reward models from specifications, (2) adaptive formal verification integrated throughout learning, (3) online abstraction calibration to quantify prediction reliability, and (4) test-time synthesis and world-model generation guided by verifiers.

Result: Enables agents to synthesize verifiable programs, derive new policies from few interactions, and maintain correctness while adapting to novelty. Provides a substrate for learning, reasoning, and adaptation with explainable behavior.

Conclusion: Foundation world models represent a comprehensive approach to creating autonomous agents that can reliably adapt to changing conditions while maintaining verifiable correctness and explainable behavior.

Abstract: The next generation of autonomous agents must not only learn efficiently but also act reliably and adapt their behavior in open worlds. Standard approaches typically assume fixed tasks and environments with little or no novelty, which limits world models’ ability to support agents that must evolve their policies as conditions change. This paper outlines a vision for foundation world models: persistent, compositional representations that unify reinforcement learning, reactive/program synthesis, and abstraction mechanisms. We propose an agenda built around four components: (i) learnable reward models from specifications to support optimization with clear objectives; (ii) adaptive formal verification integrated throughout learning; (iii) online abstraction calibration to quantify the reliability of the model’s predictions; and (iv) test-time synthesis and world-model generation guided by verifiers. Together, these components enable agents to synthesize verifiable programs, derive new policies from a small number of interactions, and maintain correctness while adapting to novelty. The resulting framework positions foundation world models as a substrate for learning, reasoning, and adaptation, laying the groundwork for agents that not only act well but can explain and justify the behavior they adopt.

[424] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

Main category: cs.LG

TL;DR: LoRA-Pre is a low-rank optimizer that reduces memory overhead in training large language models by reformulating momentum as online linear regression and using low-rank decomposition.

DetailsMotivation: Modern optimizers like Adam and Muon have significant memory overhead due to first- and second-order momenta, which constrains scalability and computational efficiency for large language models.

Method: Reformulates exponential moving average (EMA) as training a linear regressor via online gradient flow, then introduces LoRA-Pre which decomposes the full momentum matrix into a compact low-rank subspace within the online linear learner.

Result: Achieves highest performance across Llama models from 60M to 1B parameters, demonstrates remarkable rank efficiency (1/8 rank of baselines), and outperforms fine-tuning baselines with improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B.

Conclusion: LoRA-Pre provides an effective low-rank optimizer that maintains optimization performance while significantly improving memory efficiency for both pre-training and fine-tuning of large language models.

Abstract: Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer’s memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre’s efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre’s effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach’s effectiveness across both pre-training and fine-tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA-Pre.

[425] InfoNCE Induces Gaussian Distribution

Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa

Main category: cs.LG

TL;DR: InfoNCE contrastive learning induces Gaussian structure in learned representations, proven theoretically under certain assumptions and supported by experiments.

DetailsMotivation: Contrastive learning is fundamental for modern representation learning, but the statistical properties of representations learned via InfoNCE loss are not well understood. The paper aims to characterize the emergent Gaussian structure in these representations.

Method: Theoretical analysis in two regimes: 1) Under alignment and concentration assumptions, showing projections approach multivariate Gaussian distribution; 2) With less strict assumptions using small regularization promoting low feature norm and high entropy. Experimental validation on synthetic and CIFAR-10 datasets across various encoder architectures.

Result: Both theoretical analysis and experiments demonstrate consistent Gaussian behavior in representations learned via InfoNCE contrastive training. The Gaussian model enables principled analytical treatment of learned representations.

Conclusion: InfoNCE contrastive learning naturally induces Gaussian structure in representations, providing theoretical foundation for observed empirical patterns and enabling analytical approaches to contrastive learning applications.

Abstract: Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

[426] pathsig: A GPU-Accelerated Library for Truncated and Projected Path Signatures

Tobias Nygaard

Main category: cs.LG

TL;DR: pathsig is a PyTorch-native library for computing path signatures with high GPU throughput and minimal memory, offering 10-30x speedups for truncated signatures and supporting flexible projections for compact representations.

DetailsMotivation: Existing path signature libraries lack scalability for large-scale, gradient-based learning in machine learning models, creating a need for efficient computation that integrates well with modern deep learning frameworks.

Method: Develops a PyTorch-native library using CUDA kernels to compute path signatures directly in the word basis, updating signature coefficients in parallel over prefix-closed word sets, with support for anisotropic truncation and user-specified projections.

Result: Achieves 10-30x speedups for truncated signature computation and 4-10x speedups in training requiring backpropagation through signatures, with high GPU throughput and near-minimal peak memory usage.

Conclusion: pathsig provides a scalable, efficient solution for integrating path signatures into modern machine learning pipelines, enabling their use in large-scale gradient-based learning with significant performance improvements.

Abstract: Path signatures provide a rich representation of sequential data, with strong theoretical guarantees and good performance in a variety of machine-learning tasks. While signatures have progressed from fixed feature extractors to trainable components of machine-learning models, existing libraries often lack the required scalability for large-scale, gradient-based learning. To address this gap, this paper introduces pathsig, a PyTorch-native library that computes path signatures directly in the word basis. By using CUDA kernels to update signature coefficients in parallel over prefix-closed word sets, pathsig achieves high GPU throughput and near-minimal peak memory. Compared with other libraries, pathsig achieves 10-30x speedups for computation of truncated signatures and up to 4-10x speedups in training that require backpropagation through the signature. Beyond regular truncation, pathsig supports projections of the (infinite-dimensional) signature onto user-specified sets of words and anisotropic truncation motivated by inhomogeneous path regularity, enabling more compact representations that can reduce dimensionality, redundancy, and computational cost.

[427] Leveraging Non-linear Dimension Reduction and Random Walk Co-occurrence for Node Embedding

Ryan DeWolfe

Main category: cs.LG

TL;DR: COVE is a high-dimensional graph embedding method that removes low-dimension constraints, uses co-occurrence on random walks for similarity, and performs similarly to Louvain algorithm when combined with UMAP and HDBSCAN for community detection.

DetailsMotivation: The paper aims to overcome the limitations of low-dimensional node embeddings by proposing a high-dimensional embedding approach that maintains explainability while potentially improving performance on graph analysis tasks like clustering and link prediction.

Method: COVE uses non-linear dimension reduction techniques to create high-dimensional embeddings inspired by neural embedding methods that leverage co-occurrence on random walks as similarity indicators. The method is closely related to diffusion processes and can be reduced to lower dimensions using UMAP for downstream tasks.

Result: When reduced to low dimension with UMAP, COVE slightly increases performance on clustering and link prediction tasks. A COVE-UMAP-HDBSCAN pipeline performs similarly to the popular Louvain algorithm on community detection benchmarks.

Conclusion: High-dimensional graph embeddings like COVE offer an alternative to traditional low-dimensional approaches, maintaining explainability while achieving competitive performance with established community detection methods like Louvain.

Abstract: Leveraging non-linear dimension reduction techniques, we remove the low dimension constraint from node embedding and propose COVE, an explainable high dimensional embedding that, when reduced to low dimension with UMAP, slightly increases performance on clustering and link prediction tasks. The embedding is inspired by neural embedding methods that use co-occurrence on a random walk as an indication of similarity, and is closely related to a diffusion process. Extending on recent community detection benchmarks, we find that a COVE UMAP HDBSCAN pipeline performs similarly to the popular Louvain algorithm.

[428] Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning

Viet Bac Nguyen, Phuong Thai Nguyen

Main category: cs.LG

TL;DR: ACWI is an adaptive intrinsic reward scaling framework that dynamically balances intrinsic and extrinsic rewards for better exploration in sparse reward RL, using a state-dependent Beta Network instead of fixed coefficients.

DetailsMotivation: Conventional RL approaches use manually tuned scalar coefficients to balance intrinsic and extrinsic rewards, which often leads to unstable or suboptimal performance across different tasks. There's a need for an adaptive mechanism that can dynamically adjust this balance based on the agent's state and task requirements.

Method: ACWI introduces a lightweight Beta Network that predicts intrinsic reward weights directly from agent states using an encoder-based architecture. The scaling mechanism is optimized using a correlation-based objective that aligns weighted intrinsic rewards with discounted future extrinsic returns, enabling task-adaptive exploration incentives.

Result: ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines in sparse reward MiniGrid environments, achieving superior performance with minimal computational overhead.

Conclusion: ACWI provides an effective adaptive framework for balancing intrinsic and extrinsic rewards in sparse reward RL, offering better exploration and stability than fixed coefficient approaches while maintaining computational efficiency.

Abstract: We propose ACWI (Adaptive Correlation Weighted Intrinsic), an adaptive intrinsic reward scaling framework designed to dynamically balance intrinsic and extrinsic rewards for improved exploration in sparse reward reinforcement learning. Unlike conventional approaches that rely on manually tuned scalar coefficients, which often result in unstable or suboptimal performance across tasks, ACWI learns a state dependent scaling coefficient online. Specifically, ACWI introduces a lightweight Beta Network that predicts the intrinsic reward weight directly from the agent state through an encoder based architecture. The scaling mechanism is optimized using a correlation based objective that encourages alignment between the weighted intrinsic rewards and discounted future extrinsic returns. This formulation enables task adaptive exploration incentives while preserving computational efficiency and training stability. We evaluate ACWI on a suite of sparse reward environments in MiniGrid. Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.

[429] Neural Diffusion Intensity Models for Point Process Data

Xinlong Du, Harsha Honnappa, Vinayak Rao

Main category: cs.LG

TL;DR: Neural Diffusion Intensity Models: A variational framework using neural SDEs for Cox processes with theoretical guarantees and efficient amortized inference.

DetailsMotivation: Cox processes are useful for modeling overdispersed point process data but suffer from intractable nonparametric estimation and expensive MCMC-based posterior inference, creating a need for more efficient methods.

Method: Proposes Neural Diffusion Intensity Models using neural SDEs with a variational framework. Key theoretical result shows conditioning on point process observations preserves diffusion structure with explicit drift correction. Uses amortized encoder architecture mapping event sequences to posterior intensity paths via drift-corrected SDE simulation.

Result: Demonstrates accurate recovery of latent intensity dynamics and posterior paths on synthetic and real-world data, achieving orders-of-magnitude speedups over MCMC-based methods.

Conclusion: The framework provides efficient variational inference for Cox processes with theoretical guarantees, replacing expensive MCMC with single forward pass inference while maintaining accuracy.

Abstract: Cox processes model overdispersed point process data via a latent stochastic intensity, but both nonparametric estimation of the intensity model and posterior inference over intensity paths are typically intractable, relying on expensive MCMC methods. We introduce Neural Diffusion Intensity Models, a variational framework for Cox processes driven by neural SDEs. Our key theoretical result, based on enlargement of filtrations, shows that conditioning on point process observations preserves the diffusion structure of the latent intensity with an explicit drift correction. This guarantees the variational family contains the true posterior, so that ELBO maximization coincides with maximum likelihood estimation under sufficient model capacity. We design an amortized encoder architecture that maps variable-length event sequences to posterior intensity paths by simulating the drift-corrected SDE, replacing repeated MCMC runs with a single forward pass. Experiments on synthetic and real-world data demonstrate accurate recovery of latent intensity dynamics and posterior paths, with orders-of-magnitude speedups over MCMC-based methods.

[430] Agentic AI-RAN: Enabling Intent-Driven, Explainable and Self-Evolving Open RAN Intelligence

Zhizhou He, Yang Luo, Xinkai Liu, Mahdi Boloursaz Mashhadi, Mohammad Shojafar, Merouane Debbah, Rahim Tafazolli

Main category: cs.LG

TL;DR: Agentic AI controllers integrated into Open RAN architecture improve network slice lifecycle and radio resource management through planning, tool use, memory, and self-management primitives.

DetailsMotivation: Open RAN's rich control interfaces make multi-tenant, multi-objective RAN operation challenging for safety and auditability, while agentic AI systems offer structured control loops for these complex environments.

Method: Survey of agentic controllers in O-RAN, introducing primitives (Plan-Act-Observe-Reflect, skills as tool use, memory/evidence, self-management gates) and evaluating in multi-cell O-RAN simulation with comparisons to conventional ML/RL xApps.

Result: Agentic controllers achieve 8.83% average reduction in resource usage across three classic network slices compared to conventional baselines, with improvements in slice lifecycle and RRM performance.

Conclusion: Agentic AI controllers provide a promising framework for safe, auditable O-RAN operation, though security, privacy, and compliance remain architectural challenges for standards-aligned deployments.

Abstract: Open RAN (O-RAN) exposes rich control and telemetry interfaces across the Non-RT RIC, Near-RT RIC, and distributed units, but also makes it harder to operate multi-tenant, multi-objective RANs in a safe and auditable manner. In parallel, agentic AI systems with explicit planning, tool use, memory, and self-management offer a natural way to structure long-lived control loops. This article surveys how such agentic controllers can be brought into O-RAN: we review the O-RAN architecture, contrast agentic controllers with conventional ML/RL xApps, and organise the task landscape around three clusters: network slice life-cycle, radio resource management (RRM) closed loops, and cross-cutting security, privacy, and compliance. We then introduce a small set of agentic primitives (Plan-Act-Observe-Reflect, skills as tool use, memory and evidence, and self-management gates) and show, in a multi-cell O-RAN simulation, how they improve slice life-cycle and RRM performance compared to conventional baselines and ablations that remove individual primitives. Security, privacy, and compliance are discussed as architectural constraints and open challenges for standards-aligned deployments. This framework achieves an average 8.83% reduction in resource usage across three classic network slices.

[431] Learning with a Budget: Identifying the Best Arm with Resource Constraints

Zitian Li, Wang Chi Cheung

Main category: cs.LG

TL;DR: Proposes SH-RR algorithm for best arm identification with resource constraints, unifying stochastic and deterministic consumption settings

DetailsMotivation: Many applications require evaluating alternatives with varying costs/resource usage, creating need for resource-constrained best arm identification

Method: Successive Halving with Resource Rationing (SH-RR) algorithm integrates resource-aware allocation into successive halving framework

Result: Provides theoretical analysis for both stochastic and deterministic consumption settings with new effective consumption measure

Conclusion: SH-RR algorithm effectively handles resource constraints in best arm identification problems

Abstract: In many applications, evaluating the effectiveness of different alternatives comes with varying costs or resource usage. Motivated by such heterogeneity, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem, where an agent seeks to identify the best alternative (aka arm) in the presence of resource constraints. Each arm pull consumes one or more types of limited resources. We make two key contributions. First, we propose the Successive Halving with Resource Rationing (SH-RR) algorithm, which integrates resource-aware allocation into the classical successive halving framework on best arm identification. The SH-RR algorithm unifies the theoretical analysis for both the stochastic and deterministic consumption settings, with a new \textit{effective consumption measure

[432] What You Read is What You Classify: Highlighting Attributions to Text and Text-Like Inputs

Daniel S. Berman, Brian Merritt, Stanley Ta, Dana Udwin, Amanda Ernlund, Jeremy Ratcliff, Vijay Narayan

Main category: cs.LG

TL;DR: A novel explainable AI method for discrete token inputs that generalizes mask-based image explanation techniques to token sequences, addressing limitations of existing methods that fail with transformers’ global focus.

DetailsMotivation: Current explainable AI methods don't work well with discrete token inputs like text because they fail to handle both local and global features in transformer models, either identifying disparate important tokens or assigning low importance to too many tokens.

Method: Generalizes mask-based explainable AI from images to token sequences. Uses an Explainer neural network trained to create masks that hide irrelevant information for classification. Takes Hadamard product of mask and classifier’s embedding layer values, changing magnitude but keeping orientation unchanged.

Result: The method was trained for a taxonomic classifier for nucleotide sequences, showing that masked segments are less relevant to classification than unmasked ones, producing human-readable explanations.

Conclusion: The approach successfully addresses explainable AI for token-based classifiers by focusing on token segments as a whole, providing interpretable explanations that work with transformer architectures.

Abstract: At present, there are no easily understood explainable artificial intelligence (AI) methods for discrete token inputs, like text. Most explainable AI techniques do not extend well to token sequences, where both local and global features matter, because state-of-the-art models, like transformers, tend to focus on global connections. Therefore, existing explainable AI algorithms fail by (i) identifying disparate tokens of importance, or (ii) assigning a large number of tokens a low value of importance. This method for explainable AI for tokens-based classifiers generalizes a mask-based explainable AI algorithm for images. It starts with an Explainer neural network that is trained to create masks to hide information not relevant for classification. Then, the Hadamard product of the mask and the continuous values of the classifier’s embedding layer is taken and passed through the classifier, changing the magnitude of the embedding vector but keeping the orientation unchanged. The Explainer is trained for a taxonomic classifier for nucleotide sequences and it is shown that the masked segments are less relevant to classification than the unmasked ones. This method focused on the importance the token as a whole (i.e., a segment of the input sequence), producing a human-readable explanation.

[433] Sandwiching Polynomials for Geometric Concepts with Low Intrinsic Dimension

Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Main category: cs.LG

TL;DR: New method constructs low-degree sandwiching polynomials for function classes with smooth boundaries, achieving exponential improvements in degree bounds for functions of k halfspaces under Gaussian distribution.

DetailsMotivation: Low-degree sandwiching polynomials have shown power in challenging learning settings like distribution shift, testable learning, and learning with contamination. Prior methods had exponential degree bounds for functions of k halfspaces under Gaussian distribution (2^O(k)), motivating the search for more efficient constructions.

Method: New approach directly uses smoothness of target function’s boundary to construct sandwiching Lipschitz functions, which are amenable to results from high-dimensional approximation theory. Avoids complex methods like FT-mollification used in prior work.

Result: Achieves poly(k) degree sandwiching polynomials for functions of k halfspaces under Gaussian distribution (exponential improvement over prior 2^O(k) bound). For low-dimensional polynomial threshold functions with respect to Gaussians, obtains doubly exponential improvements.

Conclusion: The method provides simpler proofs and significantly better degree bounds for fundamental function classes by leveraging smoothness properties and high-dimensional approximation theory, applicable to low-dimensional function classes with smooth boundaries.

Abstract: Recent work has shown the surprising power of low-degree sandwiching polynomial approximators in the context of challenging learning settings such as learning with distribution shift, testable learning, and learning with contamination. A pair of sandwiching polynomials approximate a target function in expectation while also providing pointwise upper and lower bounds on the function’s values. In this paper, we give a new method for constructing low-degree sandwiching polynomials that yield greatly improved degree bounds for several fundamental function classes and marginal distributions. In particular, we obtain degree $\mathrm{poly}(k)$ sandwiching polynomials for functions of $k$ halfspaces under the Gaussian distribution, improving exponentially over the prior $2^{O(k)}$ bound. More broadly, our approach applies to function classes that are low-dimensional and have smooth boundary. In contrast to prior work, our proof is relatively simple and directly uses the smoothness of the target function’s boundary to construct sandwiching Lipschitz functions, which are amenable to results from high-dimensional approximation theory. For low-dimensional polynomial threshold functions (PTFs) with respect to Gaussians, we obtain doubly exponential improvements without applying the FT-mollification method of Kane used in the best previous result.

[434] Multi-Objective Reinforcement Learning for Large-Scale Tote Allocation in Human-Robot Collaborative Fulfillment Centers

Sikata Sengupta, Guangyi Liu, Omer Gottesman, Joseph W Durham, Michael Kearns, Aaron Roth, Michael Caldara

Main category: cs.LG

TL;DR: Multi-Objective Reinforcement Learning approach for optimizing container consolidation in fulfillment centers, balancing processing speed, resource usage, and space utilization while satisfying operational constraints.

DetailsMotivation: Container-based fulfillment centers need to optimize consolidation processes involving human-robotic workstations, requiring trade-offs between competing objectives (speed, resource usage, space utilization) under real-world constraints.

Method: Formulates as large-scale MORL with high-dimensional state spaces, uses constrained RL via best-response and no-regret dynamics in zero-sum games for principled minimax policy learning, includes theoretical framework for error cancellation.

Result: Policy evaluation on realistic warehouse simulations shows effective trade-off balancing, learns single policy satisfying all constraints, theoretical framework handles oscillatory behavior to return solutions close to minimax value.

Conclusion: Demonstrates promise of MORL for complex industrial decision-making, with practical applications in warehouse optimization and theoretical contributions in constrained RL.

Abstract: Optimizing the consolidation process in container-based fulfillment centers requires trading off competing objectives such as processing speed, resource usage, and space utilization while adhering to a range of real-world operational constraints. This process involves moving items between containers via a combination of human and robotic workstations to free up space for inbound inventory and increase container utilization. We formulate this problem as a large-scale Multi-Objective Reinforcement Learning (MORL) task with high-dimensional state spaces and dynamic system behavior. Our method builds on recent theoretical advances in solving constrained RL problems via best-response and no-regret dynamics in zero-sum games, enabling principled minimax policy learning. Policy evaluation on realistic warehouse simulations shows that our approach effectively trades off objectives, and we empirically observe that it learns a single policy that simultaneously satisfies all constraints, even if this is not theoretically guaranteed. We further introduce a theoretical framework to handle the problem of error cancellation, where time-averaged solutions display oscillatory behavior. This method returns a single iterate whose Lagrangian value is close to the minimax value of the game. These results demonstrate the promise of MORL in solving complex, high-impact decision-making problems in large-scale industrial systems.

[435] Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics

Egor Antipov, Alessandro Palma, Lorenzo Consoli, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis

Main category: cs.LG

TL;DR: Flow matching method for efficient density ratio estimation between intractable distributions using condition-aware generative trajectories

DetailsMotivation: Density ratio estimation is crucial for comparing sample likelihoods under different data-generating processes, but current flow-based methods are computationally expensive due to separate likelihood integrals for each distribution

Method: Leverages condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories, avoiding separate costly likelihood integrals

Result: Competitive performance on simulated benchmarks for closed-form ratio estimation, and versatile applications in single-cell genomics for treatment effect estimation and batch correction evaluation

Conclusion: The proposed flow matching approach enables efficient density ratio estimation with applications in probabilistic modeling and biological data analysis

Abstract: Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.

[436] The Stability of Online Algorithms in Performative Prediction

Gabriele Farina, Juan Carlos Perdomo

Main category: cs.LG

TL;DR: The paper presents an unconditional reduction showing that any no-regret algorithm in performative prediction settings converges to performatively stable equilibria, using martingale arguments and randomization to avoid restrictive assumptions about how models influence data distributions.

DetailsMotivation: Algorithmic predictions in decision-making create feedback loops where deployed models influence the data distributions they're later retrained on, formalized as performative prediction. Prior work required strong restrictions on how models influence distributions, limiting practical applicability.

Method: Uses an unconditional reduction approach with martingale arguments and allows randomization to show that any no-regret algorithm converges to (mixed) performatively stable equilibria, avoiding restrictive assumptions about model influence on distributions.

Result: Demonstrates convergence to performatively stable equilibria without restrictive assumptions, sidestepping recent hardness results for finding stable models. Provides insight into why common algorithms like gradient descent are naturally stabilizing.

Conclusion: The work establishes a connection between online optimization and performativity, enabling future technical transfer of ideas between these fields and providing theoretical foundations for understanding feedback loops in algorithmic decision-making.

Abstract: The use of algorithmic predictions in decision-making leads to a feedback loop where the models we deploy actively influence the data distributions we see, and later use to retrain on. This dynamic was formalized by Perdomo et al. 2020 in their work on performative prediction. Our main result is an unconditional reduction showing that any no-regret algorithm deployed in performative settings converges to a (mixed) performatively stable equilibrium: a solution in which models actively shape data distributions in ways that their own predictions look optimal in hindsight. Prior to our work, all positive results in this area made strong restrictions on how models influenced distributions. By using a martingale argument and allowing randomization, we avoid any such assumption and sidestep recent hardness results for finding stable models. Lastly, on a more conceptual note, our connection sheds light on why common algorithms, like gradient descent, are naturally stabilizing and prevent runaway feedback loops. We hope our work enables future technical transfer of ideas between online optimization and performativity.

[437] An Efficient Unsupervised Federated Learning Approach for Anomaly Detection in Heterogeneous IoT Networks

Mohsen Tajgardan, Atena Shiranzaei, Mahdi Rabbani, Reza Khoshkangini, Mahtab Jamali

Main category: cs.LG

TL;DR: Proposes an unsupervised federated learning framework for IoT anomaly detection that leverages shared features from complementary datasets while preserving dataset-specific features, with explainable AI for interpretability.

DetailsMotivation: Federated learning is promising for IoT anomaly detection but faces challenges from heterogeneous data across devices. Need to improve both model performance and privacy while handling feature heterogeneity in unsupervised FL settings.

Method: Unsupervised FL framework that extracts and leverages shared features from two IoT datasets (anomaly detection and device identification) while preserving dataset-specific features. Uses explainable AI techniques like SHAP for transparency.

Result: Experiments on real-world IoT datasets show the proposed method significantly outperforms conventional FL approaches in anomaly detection accuracy.

Conclusion: Using shared features from complementary datasets can optimize unsupervised federated learning for superior anomaly detection in decentralized IoT environments, with explainable AI enhancing transparency.

Abstract: Federated learning (FL) is an effective paradigm for distributed environments such as the Internet of Things (IoT), where data from diverse devices with varying functionalities remains localized while contributing to a shared global model. By eliminating the need to transmit raw data, FL inherently preserves privacy. However, the heterogeneous nature of IoT data, stemming from differences in device capabilities, data formats, and communication constraints, poses significant challenges to maintaining both global model performance and privacy. In the context of IoT-based anomaly detection, unsupervised FL offers a promising means to identify abnormal behavior without centralized data aggregation. Nevertheless, feature heterogeneity across devices complicates model training and optimization, hindering effective implementation. In this study we propose an efficient unsupervised FL framework that enhances anomaly detection by leveraging shared features from two distinct IoT datasets: one focused on anomaly detection and the other on device identification, while preserving dataset-specific features. To improve transparency and interpretability, we employ explainable AI techniques, such as SHAP, to identify key features influencing local model decisions. Experiments conducted on real-world IoT datasets demonstrate that the proposed method significantly outperforms conventional FL approaches in anomaly detection accuracy. This work underscores the potential of using shared features from complementary datasets to optimize unsupervised federated learning and achieve superior anomaly detection results in decentralized IoT environments.

[438] Comparing Classical and Quantum Variational Classifiers on the XOR Problem

Miras Seilkhan, Adilbek Taizhanov

Main category: cs.LG

TL;DR: Quantum machine learning models (variational quantum classifiers) are compared to classical models (logistic regression and multilayer perceptrons) on XOR classification tasks, showing that deeper quantum circuits can match classical neural network accuracy but don’t show clear advantages in robustness or efficiency.

DetailsMotivation: To investigate how quantum machine learning models compare to classical models on fundamental classification tasks like XOR, examining whether quantum principles (superposition, entanglement) provide advantages in expressivity, robustness, or efficiency for simple pattern recognition problems.

Method: Comparative evaluation of classical models (logistic regression, one-hidden-layer multilayer perceptron) and variational quantum classifiers (two-qubit circuits with depths 1 and 2) on synthetic XOR datasets with varying Gaussian noise and sample sizes, using accuracy and binary cross-entropy metrics.

Result: Model expressivity determines performance: logistic regression and depth-1 quantum circuits fail on XOR, while multilayer perceptrons and depth-2 quantum circuits achieve perfect test accuracy under representative conditions. Despite matching accuracy, classical models achieve lower binary cross-entropy and substantially shorter training times.

Conclusion: Deeper variational quantum classifiers can match classical neural networks in accuracy on low-dimensional XOR benchmarks, but no clear empirical advantage in robustness or efficiency is observed in the examined settings, suggesting quantum models need further development to demonstrate practical advantages over classical approaches.

Abstract: Quantum machine learning applies principles such as superposition and entanglement to data processing and optimization. Variational quantum models operate on qubits in high-dimensional Hilbert spaces and provide an alternative approach to model expressivity. We compare classical models and a variational quantum classifier on the XOR problem. Logistic regression, a one-hidden-layer multilayer perceptron, and a two-qubit variational quantum classifier with circuit depths 1 and 2 are evaluated on synthetic XOR datasets with varying Gaussian noise and sample sizes using accuracy and binary cross-entropy. Performance is determined primarily by model expressivity. Logistic regression and the depth-1 quantum circuit fail to represent XOR reliably, whereas the multilayer perceptron and the depth-2 quantum circuit achieve perfect test accuracy under representative conditions. Robustness analyses across noise levels, dataset sizes, and random seeds confirm that circuit depth is decisive for quantum performance on this task. Despite matching accuracy, the multilayer perceptron achieves lower binary cross-entropy and substantially shorter training time. Hardware execution preserves the global XOR structure but introduces structured deviations in the decision function. Overall, deeper variational quantum classifiers can match classical neural networks in accuracy on low-dimensional XOR benchmarks, but no clear empirical advantage in robustness or efficiency is observed in the examined settings.

[439] Adaptive Combinatorial Experimental Design: Pareto Optimality for Decision-Making and Inference

Hongrui Xie, Junyu Cao, Kan Xu

Main category: cs.LG

TL;DR: First investigation of adaptive combinatorial experimental design in combinatorial multi-armed bandits, focusing on trade-off between regret minimization and statistical power, with Pareto optimal algorithms for full-bandit and semi-bandit feedback.

DetailsMotivation: Address the fundamental tension in combinatorial multi-armed bandits between minimizing regret (exploiting high-reward arms) and achieving statistical power for accurate inference (exploring suboptimal actions). Current approaches typically focus on one objective, but real-world applications often require balancing both.

Method: Formalize the trade-off through Pareto optimality concept. Consider two information structures: full-bandit feedback and semi-bandit feedback. Propose two algorithms: MixCombKL for full-bandit feedback and MixCombUCB for semi-bandit feedback. Provide theoretical guarantees showing both algorithms are Pareto optimal with finite-time guarantees on both regret and estimation error.

Result: Theoretical analysis shows both algorithms achieve Pareto optimality with finite-time guarantees. Richer feedback (semi-bandit vs full-bandit) significantly tightens the attainable Pareto frontier, with primary gains in improved estimation accuracy. Establishes principled framework for adaptive combinatorial experimentation in multi-objective decision-making.

Conclusion: This work provides the first investigation into adaptive combinatorial experimental design in CMAB, establishing a principled framework for balancing regret minimization and statistical power through Pareto optimal algorithms with different feedback structures.

Abstract: In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization and statistical power in combinatorial multi-armed bandits (CMAB). While minimizing regret requires repeated exploitation of high-reward arms, accurate inference on reward gaps requires sufficient exploration of suboptimal actions. We formalize this trade-off through the concept of Pareto optimality and establish equivalent conditions for Pareto-efficient learning in CMAB. We consider two relevant cases under different information structures, i.e., full-bandit feedback and semi-bandit feedback, and propose two algorithms MixCombKL and MixCombUCB respectively for these two cases. We provide theoretical guarantees showing that both algorithms are Pareto optimal, achieving finite-time guarantees on both regret and estimation error of arm gaps. Our results further reveal that richer feedback significantly tightens the attainable Pareto frontier, with the primary gains arising from improved estimation accuracy under our proposed methods. Taken together, these findings establish a principled framework for adaptive combinatorial experimentation in multi-objective decision-making.

[440] Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Javier Pulido, Filipe Rodrigues

Main category: cs.LG

TL;DR: Chronos-2, a general-purpose time-series foundation model, achieves state-of-the-art or competitive zero-shot performance across diverse transportation forecasting tasks without dataset-specific training.

DetailsMotivation: Current deep learning approaches for transportation forecasting require extensive dataset-specific training, architecture design, and hyperparameter tuning. The paper investigates whether general-purpose time-series foundation models can serve as effective zero-shot forecasters for transportation tasks.

Method: Benchmarks the zero-shot performance of Chronos-2 across ten real-world transportation datasets covering highway traffic volume/flow, urban traffic speed, bike-sharing demand, and EV charging station data. Uses consistent evaluation protocol comparing against classical statistical baselines and specialized deep learning architectures.

Result: Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets without fine-tuning, frequently outperforming specialized models, especially at longer horizons. It also provides useful uncertainty quantification through native probabilistic outputs.

Conclusion: Time-series foundation models like Chronos-2 can serve as effective zero-shot forecasters for transportation tasks, supporting their adoption as key baselines for transportation forecasting research.

Abstract: Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.

[441] Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam

Main category: cs.LG

TL;DR: CHAT is a novel RNN-T extension that processes audio in fixed chunks with cross-attention, improving efficiency and accuracy for streaming speech tasks.

DetailsMotivation: To address limitations of RNN-T models in streaming speech processing, particularly the strict monotonic alignment that hurts performance in speech translation, while maintaining real-time constraints.

Method: Extends RNN-T with chunk-wise processing using fixed-size audio chunks and employs cross-attention within each chunk for local alignment modeling, reducing temporal dimension.

Result: Achieves up to 46.2% reduction in peak training memory, 1.36X faster training, 1.69X faster inference, and accuracy improvements: up to 6.3% WER reduction for ASR and 18.0% BLEU improvement for speech translation.

Conclusion: CHAT provides a practical solution for deploying more capable streaming speech models without sacrificing real-time performance, particularly effective for speech translation tasks.

Abstract: We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T’s streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks – up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T’s strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.

[442] Histopathology Image Normalization via Latent Manifold Compaction

Xiaolong Zhang, Jianwei Zhang, Selim Sevim, Emek Demir, Ece Eksi, Xubo Song

Main category: cs.LG

TL;DR: LMC is an unsupervised representation learning framework for histopathology image harmonization that learns batch-invariant embeddings by compacting stain-induced latent manifolds from a single source dataset.

DetailsMotivation: Batch effects from technical variations in histopathology staining protocols, scanners, and acquisition pipelines hinder cross-batch generalization and limit reliable deployment of computational pathology models across clinical sites.

Method: Latent Manifold Compaction (LMC) learns batch-invariant embeddings through explicit compaction of stain-induced latent manifolds from a single source dataset, enabling generalization to unseen target domain data.

Result: LMC substantially reduces batch-induced separations across multiple datasets and consistently outperforms state-of-the-art normalization methods in downstream cross-batch classification and detection tasks.

Conclusion: LMC enables superior generalization for computational pathology models by addressing batch effects through unsupervised representation learning and latent manifold compaction.

Abstract: Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge for computational pathology, hindering cross-batch generalization and limiting reliable deployment of models across clinical sites. In this work, we introduce Latent Manifold Compaction (LMC), an unsupervised representation learning framework that performs image harmonization by learning batch-invariant embeddings from a single source dataset through explicit compaction of stain-induced latent manifolds. This allows LMC to generalize to target domain data unseen during training. Evaluated on three challenging public and in-house benchmarks, LMC substantially reduces batch-induced separations across multiple datasets and consistently outperforms state-of-the-art normalization methods in downstream cross-batch classification and detection tasks, enabling superior generalization.

[443] Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web–Knowledge–Web Pipeline

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

Main category: cs.LG

TL;DR: A web-to-knowledge-to-web pipeline for discovering SMEs in specialized industries using iterative crawling, knowledge graph construction, and coverage estimation.

DetailsMotivation: Existing business databases have substantial coverage gaps for sub-tier suppliers and firms in emerging niche markets, making it difficult to identify the full landscape of SMEs in specialized industry sectors for supply-chain resilience.

Method: Proposes a W→K→W pipeline that iteratively: (1) crawls domain-specific web sources to discover candidate supplier entities, (2) extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3) uses the knowledge graph’s topology and coverage signals to guide subsequent crawling toward under-represented regions. Introduces a coverage estimation framework inspired by ecological species-richness estimators.

Result: Experiments on semiconductor equipment manufacturing sector (NAICS 333242) show the pipeline achieves highest precision (0.138) and F1 (0.118) among all methods using same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration 3 with only 112 pages.

Conclusion: The W→K→W pipeline effectively discovers SMEs in specialized industries with limited crawl budgets, using knowledge graph-guided crawling and ecological-inspired coverage estimation to improve discovery completeness.

Abstract: Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps – particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web–Knowledge–Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)uses the knowledge graph’s topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration3 with only 112 pages.

[444] Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

Amir Asiaee

Main category: cs.LG

TL;DR: Paper proposes using structured pruning as search for causal abstractions in neural networks, deriving interventional risk objective that yields closed-form pruning criteria based on activation variance.

DetailsMotivation: Neural networks are hypothesized to implement interpretable causal mechanisms, but verifying this requires finding causal abstractions (simpler SCMs faithful under interventions). Current methods for discovering such abstractions are inefficient, requiring brute-force interchange interventions or retraining.

Method: Reframe problem by viewing structured pruning as search over approximate abstractions. Treat trained network as deterministic SCM, derive Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, score reduces to activation variance.

Result: Method efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, validated via interchange interventions. Recovers variance-based pruning as special case while clarifying when it fails.

Conclusion: Structured pruning can be used as principled approach to discover causal abstractions in neural networks, providing efficient alternative to brute-force intervention methods.

Abstract: Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction – a simpler, high-level Structural Causal Model (SCM) faithful to the network under interventions. Discovering such abstractions is hard: it typically demands brute-force interchange interventions or retraining. We reframe the problem by viewing structured pruning as a search over approximate abstractions. Treating a trained network as a deterministic SCM, we derive an Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, our score reduces to activation variance, recovering variance-based pruning as a special case while clarifying when it fails. The resulting procedure efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, which we validate via interchange interventions.

[445] Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Shruti Joshi, Théo Saulus, Wieland Brendel, Philippe Brouillard, Dhanya Sridhar, Patrik Reizinger

Main category: cs.LG

TL;DR: Paper shows standard identifiability metrics are only valid under specific structural conditions and can produce false results when assumptions are violated, introducing a taxonomy and evaluation suite for proper testing.

DetailsMotivation: Current evaluation of representation learning identifiability relies on standard metrics (MCC, DCI, R²) that assume these metrics reflect recovery up to theoretical equivalence classes, but this assumption may not hold in practice.

Method: Analyzes structural conditions under which metrics are valid, separates DGP assumptions from encoder geometry, creates taxonomy to characterize validity domains, and releases evaluation suite for stress testing.

Result: Shows metrics become misspecified when assumptions are violated, producing systematic false positives/negatives, both within classical identifiability regimes and post-hoc settings where identifiability is most needed.

Conclusion: Standard identifiability metrics have limited validity domains; proper evaluation requires understanding structural assumptions and using comprehensive testing frameworks.

Abstract: Identifiability in representation learning is commonly evaluated using standard metrics (e.g., MCC, DCI, R^2) on synthetic benchmarks with known ground-truth factors. These metrics are assumed to reflect recovery up to the equivalence class guaranteed by identifiability theory. We show that this assumption holds only under specific structural conditions: each metric implicitly encodes assumptions about both the data-generating process (DGP) and the encoder. When these assumptions are violated, metrics become misspecified and can produce systematic false positives and false negatives. Such failures occur both within classical identifiability regimes and in post-hoc settings where identifiability is most needed. We introduce a taxonomy separating DGP assumptions from encoder geometry, use it to characterise the validity domains of existing metrics, and release an evaluation suite for reproducible stress testing and comparison.

[446] Memory Caching: RNNs with Growing Memory

Ali Behrouz, Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

Main category: cs.LG

TL;DR: Memory Caching (MC) enhances recurrent models by caching memory state checkpoints, allowing RNNs to achieve growing memory capacity similar to Transformers while maintaining subquadratic complexity.

DetailsMotivation: Transformers have quadratic complexity due to growing memory with context length, while recurrent models have fixed-size memory that underperforms in recall-intensive tasks. There's a need for models that combine the benefits of both approaches.

Method: Proposes Memory Caching (MC) with four variants: gated aggregation and sparse selective mechanisms that cache checkpoints of hidden states. This allows RNNs to have growing effective memory capacity that interpolates between fixed RNN memory and growing Transformer memory.

Result: MC enhances recurrent model performance on language modeling and long-context understanding tasks. In in-context recall tasks, MC variants show competitive performance, closing the gap with Transformers and outperforming state-of-the-art recurrent models.

Conclusion: Memory Caching is an effective technique that improves recurrent models by providing flexible memory capacity scaling, offering a practical alternative that balances the efficiency of RNNs with the recall capabilities of Transformers.

Abstract: Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse domains, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory (i.e., $O(L)$ complexity) of RNNs and the growing memory (i.e., $O(L^2)$ complexity) of Transformers. We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules. Our experimental results on language modeling, and long-context understanding tasks show that MC enhances the performance of recurrent models, supporting its effectiveness. The results of in-context recall tasks indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.

[447] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou

Main category: cs.LG

TL;DR: CUDA Agent: A reinforcement learning system that trains LLMs to generate highly optimized CUDA kernels, outperforming compiler-based systems like torch.compile and proprietary models by significant margins.

DetailsMotivation: Current LLMs are uncompetitive with compiler-based systems for CUDA kernel generation, and existing approaches fail to fundamentally improve models' intrinsic CUDA optimization capabilities, resulting in limited performance gains.

Method: Three-component system: 1) Scalable data synthesis pipeline, 2) Skill-augmented CUDA development environment with automated verification and profiling for reliable rewards, 3) Reinforcement learning techniques enabling stable training.

Result: Achieves state-of-the-art on KernelBench with 100%, 100%, and 92% faster rates over torch.compile on Level-1, Level-2, and Level-3 splits, outperforming proprietary models like Claude Opus 4.5 and Gemini 3 Pro by ~40% on hardest Level-3 setting.

Conclusion: CUDA Agent demonstrates that agentic reinforcement learning can develop deep CUDA kernel expertise in LLMs, fundamentally improving their optimization capabilities beyond existing approaches.

Abstract: GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model’s intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40% on the hardest Level-3 setting.

[448] What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Main category: cs.LG

TL;DR: Reward model accuracy alone doesn’t guarantee efficient RLHF optimization; reward variance is crucial for avoiding flat objective landscapes and enabling faster convergence.

DetailsMotivation: Current RLHF evaluation focuses on reward model accuracy, but it's unclear if accuracy fully captures what makes a reward model an effective teacher for language model optimization.

Method: Theoretical analysis from optimization perspective showing that low reward variance leads to flat objective landscapes regardless of accuracy. Experiments with models up to 8B parameters validate the interplay between reward variance, accuracy, and optimization rate.

Result: Even perfectly accurate reward models can cause extremely slow optimization if they induce low variance, while less accurate models with higher variance can outperform them. Reward models that work well for one language model may induce low variance for another.

Conclusion: Reward models should be evaluated beyond just accuracy - they need to induce sufficient variance for efficient optimization. This reveals a fundamental limitation of current reward model evaluation practices.

Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.

[449] On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models

Ali Al Sahili, Ali Chehab, Razane Tajeddine

Main category: cs.LG

TL;DR: This paper investigates the connection between training data extraction and membership inference attacks in LLMs, benchmarking MIA techniques within data extraction pipelines to evaluate their practical utility.

DetailsMotivation: LLMs are prone to memorizing training data, creating privacy risks through training data extraction and membership inference attacks. The research aims to understand how these threats are interconnected and evaluate the practical effectiveness of MIA techniques in real-world extraction scenarios.

Method: The study integrates multiple MIA techniques into a data extraction pipeline to systematically benchmark their effectiveness. Researchers compare performance in this integrated setting against conventional MIA benchmarks to evaluate practical utility.

Result: The paper provides benchmarking results showing how different MIA techniques perform when integrated into data extraction pipelines, revealing their practical effectiveness compared to conventional evaluation settings.

Conclusion: The research demonstrates the interconnected nature of training data extraction and membership inference attacks, providing insights into the practical utility of MIA techniques in real-world privacy threat scenarios involving LLMs.

Abstract: Large Language Models (LLMs) are prone to memorizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA benchmarks, allowing us to evaluate their practical utility in real-world extraction scenarios.

[450] Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Thomas Chen, Patrícia Muñoz Ewald

Main category: cs.LG

TL;DR: Theoretical analysis of cost minimization in shallow ReLU networks without gradient descent, focusing on geometric structure of minimizers and explicit construction of upper bounds.

DetailsMotivation: To understand cost minimization in underparametrized shallow ReLU networks through explicit construction methods rather than gradient descent, with focus on geometric structure of minimizers.

Method: Theoretical analysis using explicit construction of upper bounds for L² cost function in ReLU networks. Considers input space ℝᴹ, output space ℝᴽ with Q≤M, and arbitrarily large training samples. Proves O(δₚ) upper bound where δₚ measures signal-to-noise ratio.

Result: Proved upper bound of order O(δₚ) on minimum cost function. For M=Q case, explicitly determined exact degenerate local minimum with relative error O(δₚ²) from upper bound. Constructively trained network metrizes particular Q-dimensional subspace in input space.

Conclusion: The paper provides theoretical understanding of cost minimization in shallow ReLU networks through explicit construction methods, revealing geometric structure of minimizers and relationships between network architecture and data properties.

Abstract: In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

[451] Less is more – the Dispatcher/ Executor principle for multi-task Reinforcement Learning

Martin Riedmiller, Andrea Gesmundo, Tim Hertweck, Roland Hafner

Main category: cs.LG

TL;DR: A position paper proposing a dispatcher/executor architecture for multi-task RL controllers to improve generalization and data efficiency through structural design principles rather than just scaling up models and data.

DetailsMotivation: Humans naturally abstract away unnecessary details when solving complex decision-making problems in varying environments. Current RL trends focus on large neural networks trained on massive datasets, but this paper argues that structural design principles can improve generalization and data efficiency when data is limited.

Method: Proposes a dispatcher/executor principle where the controller is partitioned into two entities: a dispatcher that understands the task and an executor that computes device-specific controls. These are connected by a strongly regularizing communication channel to enforce abstraction.

Result: The paper provides conceptual evidence that structural design principles (like the dispatcher/executor architecture) can be valuable for improving generalization and data efficiency in RL, particularly when data is scarce rather than abundant.

Conclusion: While acknowledging the power of scaling (Sutton’s ‘bitter lesson’), the authors argue that considering structure and adding design principles is critical when data is a precious resource, offering an alternative to the current trend of massive models and datasets.

Abstract: Humans instinctively know how to neglect details when it comes to solve complex decision making problems in environments with unforeseeable variations. This abstraction process seems to be a vital property for most biological systems and helps to ‘abstract away’ unnecessary details and boost generalisation. In this work we introduce the dispatcher/ executor principle for the design of multi-task Reinforcement Learning controllers. It suggests to partition the controller in two entities, one that understands the task (the dispatcher) and one that computes the controls for the specific device (the executor) - and to connect these two by a strongly regularizing communication channel. The core rationale behind this position paper is that changes in structure and design principles can improve generalisation properties and drastically enforce data-efficiency. It is in some sense a ‘yes, and …’ response to the current trend of using large neural networks trained on vast amounts of data and bet on emerging generalisation properties. While we agree on the power of scaling - in the sense of Sutton’s ‘bitter lesson’ - we will give some evidence, that considering structure and adding design principles can be a valuable and critical component in particular when data is not abundant and infinite, but is a precious resource.

[452] DirMixE: Harnessing Test Agnostic Long-tail Recognition with Hierarchical Label Vartiations

Zhiyong Yang, Qianqian Xu, Sicong Li, Zitai Wang, Xiaochun Cao, Qingming Huang

Main category: cs.LG

TL;DR: DirMixE: A Mixture-of-Experts approach for test-agnostic long-tail recognition that addresses both global and local variations in unknown test distributions using Dirichlet meta-distributions, with theoretical analysis and experimental validation.

DetailsMotivation: Traditional long-tail recognition methods assume known test distributions, but real-world scenarios involve unknown and arbitrarily imbalanced test label distributions. Existing Mixture-of-Expert approaches only handle global variations in test distributions, leaving local variations unaddressed.

Method: Proposes DirMixE, a MoE strategy that assigns experts to different Dirichlet meta-distributions of label distributions to target local variations, while diversity among meta-distributions captures global variations. Also develops Latent Skill Finetuning (LSF) framework for parameter-efficient finetuning of foundation models with LoRA and Adapter implementations.

Result: Extensive experiments on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist validate DirMixE’s effectiveness. Theoretical analysis provides generalization error bounds and shows variance-based regularization tightens these bounds.

Conclusion: DirMixE addresses both global and local variations in test-agnostic long-tail recognition through Dirichlet meta-distributions, providing stable optimization and better performance quantification. The LSF framework enables efficient finetuning of foundation models with theoretical guarantees.

Abstract: This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused on a particular neighbor. Traditional methods predominantly use a Mixture-of-Expert (MoE) approach, targeting a few fixed test label distributions that exhibit substantial global variations. However, the local variations are left unconsidered. To address this issue, we propose a new MoE strategy, DirMixE, which assigns experts to different Dirichlet meta-distributions of the label distribution, each targeting a specific aspect of local variations. Additionally, the diversity among these Dirichlet meta-distributions inherently captures global variations. This dual-level approach also leads to a more stable objective function, allowing us to sample different test distributions better to quantify the mean and variance of performance outcomes. Building on this idea, we develop a general Latent Skill Finetuning (LSF) framework for parameter-efficient finetuning of foundation models. We provide implementations based on LoRA and Adapter. Theoretically, we derive upper bounds on the generalization error for both standard learning and PEFT. Under mild assumptions, we show that the variance-based regularization helps tighten these bounds. Furthermore, we prove that the covering number of the PEFT hypothesis class scales with the number of trainable parameters. Finally, extensive experiments on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist validate the effectiveness of DirMixE.

[453] Joint Distribution-Informed Shapley Values for Sparse Counterfactual Explanations

Lei You, Yijun Bian, Lele Cao

Main category: cs.LG

TL;DR: COLA is a post-hoc framework that refines counterfactual explanations by using optimal transport and Shapley attribution to minimize feature edits while preserving target effects.

DetailsMotivation: Existing counterfactual explanation methods often modify more features than necessary, reducing clarity and actionability for users. There's a need for more minimal and interpretable counterfactuals.

Method: COLA uses optimal transport to compute coupling between factual and counterfactual sets, then applies Shapley-based attribution (p-SHAP) to select minimal feature edits while preserving target effects. It’s model- and generator-agnostic.

Result: Across four datasets, twelve models, and five CE generators, COLA achieves same target effects with only 26-45% of original feature edits. Shows near-optimality on benchmark.

Conclusion: COLA effectively refines counterfactual explanations to be more minimal and actionable while preserving target effects, with theoretical guarantees on distance preservation.

Abstract: Counterfactual explanations (CE) aim to reveal how small input changes flip a model’s prediction, yet many methods modify more features than necessary, reducing clarity and actionability. We introduce \emph{COLA}, a model- and generator-agnostic post-hoc framework that refines any given CE by computing a coupling via optimal transport (OT) between factual and counterfactual sets and using it to drive a Shapley-based attribution (\emph{$p$-SHAP}) that selects a minimal set of edits while preserving the target effect. Theoretically, OT minimizes an upper bound on the $W_1$ divergence between factual and counterfactual outcomes and that, under mild conditions, refined counterfactuals are guaranteed not to move farther from the factuals than the originals. Empirically, across four datasets, twelve models, and five CE generators, COLA achieves the same target effects with only 26–45% of the original feature edits. On a small-scale benchmark, COLA shows near-optimality.

[454] Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng

Main category: cs.LG

TL;DR: Semantic Parallelism (Sem-MoE) reduces communication costs in MoE model serving by co-locating experts and their activating tokens through collaborative model-data scheduling.

DetailsMotivation: Current expert parallelism (EP) for MoE models suffers from expensive inter-device communication due to all-to-all collectives when routing tokens to remote experts. Existing approaches treat expert placement and token scheduling separately, leading to excessive communication and compromised inference efficiency.

Method: Semantic Parallelism minimizes communication costs through model-data collaborative scheduling with three techniques: 1) Offline model scheduling that clusters and collocates experts based on co-activation patterns; 2) Online inter-request data scheduling for Attention-DP setups that rebatches requests to devices hosting frequently activated experts; 3) Online intra-request data scheduling for Attention-TP setups that fuses token reshuffling into inference pipeline to reduce remote routing.

Result: Sem-MoE implementation in SGLANG effectively reduces all-to-all communication volume in EP and achieves superior inference throughput compared to existing solutions.

Conclusion: Semantic Parallelism offers a novel paradigm for MoE serving that addresses communication bottlenecks through collaborative scheduling, significantly improving inference efficiency over traditional expert parallelism approaches.

Abstract: Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. We implement Semantic Parallelism in a framework called Sem-MoE. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. (2) Online inter-request data scheduling for Attention-DP setups, which proactively rebatches incoming requests onto the device that hosts experts most likely and frequently activated by the corresponding requests. (3) Online intra-request data scheduling for Attention-TP setups, which seamlessly fuses a token reshuffling procedure into the original inference pipeline and proactively reschedules tokens to devices to reduce dispersed remote routing. We build Sem-MoE into a prevailing LLM serving engine SGLANG. Experiments show our collaborative scheduling approach can effectively reduce the all-to-all communication volume in EP and achieve superior inference throughput compared to existing solutions.

[455] Operator Learning with Domain Decomposition for Geometry Generalization in PDE Solving

Jianing Huang, Kaixuan Zhang, Youjia Wu, Ze Cheng

Main category: cs.LG

TL;DR: A framework for neural operator learning with domain decomposition (Schwarz Neural Inference) that improves geometry generalization and data efficiency for solving PDEs on arbitrary domains.

DetailsMotivation: Neural operators are powerful for solving PDEs but suffer from data-hungry nature and poor transferability to new geometries, limiting their widespread application.

Method: Proposes a local-to-global framework with domain decomposition called Schwarz Neural Inference (SNI), which partitions domains into subdomains, solves local problems with neural operators, and stitches solutions iteratively.

Result: Achieves remarkable geometry generalization on several representative PDEs with diverse boundary conditions, outperforming alternative methods, with theoretical convergence rate and error bound analysis.

Conclusion: The domain decomposition framework effectively addresses geometry generalization and data efficiency challenges in neural operator learning for PDEs.

Abstract: Neural operators have become increasingly popular in solving \textit{partial differential equations} (PDEs) due to their superior capability to capture intricate mappings between function spaces over complex domains. However, the data-hungry nature of operator learning inevitably poses a bottleneck for their widespread applications. At the core of the challenge lies the absence of transferability of neural operators to new geometries. To tackle this issue, we propose operator learning with domain decomposition, a local-to-global framework to solve PDEs on arbitrary geometries. Under this framework, we devise an iterative scheme \textit{Schwarz Neural Inference} (SNI). This scheme allows for partitioning of the problem domain into smaller subdomains, on which local problems can be solved with neural operators, and stitching local solutions to construct a global solution. Additionally, we provide a theoretical analysis of the convergence rate and error bound. We conduct extensive experiments on several representative PDEs with diverse boundary conditions and achieve remarkable geometry generalization compared to alternative methods. These analysis and experiments demonstrate the proposed framework’s potential in addressing challenges related to geometry generalization and data efficiency.

[456] TimeMAE: Self-Supervised Representations of Time Series with Decoupled Masked Autoencoders

Mingyue Cheng, Xiaoyu Tao, Zhiding Liu, Qi Liu, Hao Zhang, Rujiao Zhang, Enhong Chen

Main category: cs.LG

TL;DR: TimeMAE is a self-supervised framework for time series that uses semantic unit elevation and decoupled representation learning to improve masked modeling, achieving better performance in label-scarce scenarios.

DetailsMotivation: Existing self-supervised methods for time series operate at point level with unidirectional encoding, leading to low semantic density and mismatch between pre-training and downstream optimization. There's a need for better transferable representations from unlabeled time series for data-scarce classification tasks.

Method: TimeMAE segments time series into non-overlapping sub-series as semantic units, uses a decoupled masked autoencoder that separately encodes visible and masked regions (avoiding artificial masked tokens), and employs two complementary objectives: masked codeword classification (discretizing sub-series via learned tokenizer) and masked representation regression (aligning continuous representations via momentum-updated target encoder).

Result: Extensive experiments on five datasets demonstrate that TimeMAE outperforms competitive baselines, particularly in label-scarce scenarios and transfer learning scenarios.

Conclusion: TimeMAE effectively addresses limitations of existing self-supervised time series methods by elevating semantic units and decoupling representations, providing a strong framework for learning transferable representations from unlabeled time series.

Abstract: Learning transferable representations from unlabeled time series is crucial for improving performance in data-scarce classification. Existing self-supervised methods often operate at the point level and rely on unidirectional encoding, leading to low semantic density and a mismatch between pre-training and downstream optimization. In this paper, we propose TimeMAE, a self-supervised framework that reformulates masked modeling for time series via semantic unit elevation and decoupled representation learning. Instead of modeling individual time steps, TimeMAE segments time series into non-overlapping sub-series to form semantically enriched units, enabling more informative masked reconstruction while reducing computational cost. To address the representation discrepancy introduced by masking, we design a decoupled masked autoencoder that separately encodes visible and masked regions, avoiding artificial masked tokens in the main encoder. To guide pre-training, we introduce two complementary objectives: masked codeword classification, which discretizes sub-series semantics via a learned tokenizer and masked representation regression, which aligns continuous representations through a momentum-updated target encoder. Extensive experiments on five datasets demonstrate that TimeMAE outperforms competitive baselines, particularly in label-scarce scenarios and transfer learning scenarios.

[457] Gradient is All You Need? How Consensus-Based Optimization can be Interpreted as a Stochastic Relaxation of Gradient Descent

Konstantin Riedl, Timo Klock, Carina Geldhauser, Massimo Fornasier

Main category: cs.LG

TL;DR: CBO (consensus-based optimization) is shown to behave like stochastic gradient descent despite being derivative-free, offering theoretical insights into how stochastic perturbations help overcome nonconvex optimization barriers.

DetailsMotivation: To provide a novel theoretical understanding of gradient-based learning algorithms by linking consensus-based optimization (a derivative-free method) to stochastic gradient descent, explaining how stochastic perturbations can help overcome nonconvex optimization challenges.

Method: Analytical perspective interpreting CBO as a stochastic relaxation of gradient descent, showing that through particle communication, CBO exhibits SGD-like behavior despite only using objective function evaluations.

Result: CBO is provably globally convergent to global minimizers for classes of nonsmooth and nonconvex functions, revealing an intrinsic gradient descent nature in derivative-free heuristics.

Conclusion: The paper provides theoretical insights connecting derivative-free optimization to gradient-based methods, explaining how stochastic perturbations help escape local minima in nonconvex optimization.

Abstract: In this paper, we provide a novel analytical perspective on the theoretical understanding of gradient-based learning algorithms by interpreting consensus-based optimization (CBO), a recently proposed multi-particle derivative-free optimization method, as a stochastic relaxation of gradient descent. Remarkably, we observe that through communication of the particles, CBO exhibits a stochastic gradient descent (SGD)-like behavior despite solely relying on evaluations of the objective function. The fundamental value of such link between CBO and SGD lies in the fact that CBO is provably globally convergent to global minimizers for ample classes of nonsmooth and nonconvex objective functions. Hence, on the one side, we offer a novel explanation for the success of stochastic relaxations of gradient descent by furnishing useful and precise insights that explain how problem-tailored stochastic perturbations of gradient descent (like the ones induced by CBO) overcome energy barriers and reach deep levels of nonconvex functions. On the other side, and contrary to the conventional wisdom for which derivative-free methods ought to be inefficient or not to possess generalization abilities, our results unveil an intrinsic gradient descent nature of heuristics. Instructive numerical illustrations support the provided theoretical insights.

[458] DRL-ORA: Distributional Reinforcement Learning with Online Risk Adaption

Yupeng Wu, Wenyun Li, Wenjie Huang, Chin Pang Ho

Main category: cs.LG

TL;DR: DRL-ORA is a distributional RL framework that dynamically adjusts epistemic risk levels online using total variation minimization, unifying risk adaptation approaches and outperforming fixed-risk methods.

DetailsMotivation: RL agents must make decisions with incomplete environment knowledge. Dynamically adjusting epistemic risk levels can help achieve reliable policies in safety-critical settings more efficiently than fixed-risk approaches.

Method: Proposes Distributional RL with Online Risk Adaptation (DRL-ORA) that quantifies both epistemic and implicit aleatory uncertainties, dynamically adjusts epistemic risk levels by solving total variation minimization online, and uses grid search with Follow-The-Leader-type algorithm for efficient risk level selection.

Result: DRL-ORA outperforms existing methods relying on fixed risk levels or manually designed risk adaptation in multiple classes of tasks, offering better explainability and flexibility.

Conclusion: The framework unifies existing risk adaptation approaches and provides an efficient online method for dynamic risk adjustment in reinforcement learning, particularly valuable for safety-critical applications.

Abstract: One of the main challenges in reinforcement learning (RL) is that the agent has to make decisions that would influence the future performance without having complete knowledge of the environment. Dynamically adjusting the level of epistemic risk during the learning process can help to achieve reliable policies in safety-critical settings with better efficiency. In this work, we propose a new framework, Distributional RL with Online Risk Adaptation (DRL-ORA). This framework quantifies both epistemic and implicit aleatory uncertainties in a unified manner and dynamically adjusts the epistemic risk levels by solving a total variation minimization problem online. The framework unifies the existing variants of risk adaption approaches and offers better explainability and flexibility. The selection of risk levels is performed efficiently via a grid search using a Follow-The-Leader-type algorithm, where the offline oracle also corresponds to a ‘‘satisficing measure’’ under a specially modified loss function. We show that DRL-ORA outperforms existing methods that rely on fixed risk levels or manually designed risk level adaptation in multiple classes of tasks.

[459] On Minimal Depth in Neural Networks

Juan L. Valerdi

Main category: cs.LG

TL;DR: Geometric framework analyzes ReLU network expressivity via depth complexity of convex polytopes, establishing bounds and showing Input Convex Neural Networks cannot represent all convex CPWL functions with fixed depth.

DetailsMotivation: To understand the relationship between neural network depth and representational capacity, particularly for ReLU networks, using geometric analysis of convex polytopes to derive theoretical bounds on expressivity.

Method: Develops a geometric framework analyzing expressivity of ReLU networks through depth complexity for convex polytopes, using alternating convex hull and Minkowski sum operations to construct polytopes and derive depth bounds.

Result: Establishes lower/upper bounds on polytope depth, provides geometric proof of Arora et al.’s expressivity bound, and proves convex polytopes lack universal depth bound (cyclic polytopes in n≥4 have unbounded depth with vertices).

Conclusion: Input Convex Neural Networks cannot represent all convex CPWL functions with fixed depth, revealing expressivity separation between ICNNs and standard ReLU networks, with geometric framework providing rigorous depth analysis tools.

Abstract: Understanding the relationship between the depth of a neural network and its representational capacity is a central problem in deep learning theory. In this work, we develop a geometric framework to analyze the expressivity of ReLU networks with the notion of depth complexity for convex polytopes. The depth of a polytope recursively quantifies the number of alternating convex hull and Minkowski sum operations required to construct it. This geometric perspective serves as a rigorous tool for deriving depth lower bounds and understanding the structural limits of deep neural architectures. We establish lower and upper bounds on the depth of polytopes, as well as tight bounds for classical families. These results yield two main consequences. First, we provide a purely geometric proof of the expressivity bound by Arora et al. (2018), confirming that $\lceil \log_2(n+1)\rceil$ hidden layers suffice to represent any continuous piecewise linear (CPWL) function. Second, we prove that, unlike general ReLU networks, convex polytopes do not admit a universal depth bound. Specifically, the depth of cyclic polytopes in dimensions $n \geq 4$ grows unboundedly with the number of vertices. This result implies that Input Convex Neural Networks (ICNNs) cannot represent all convex CPWL functions with a fixed depth, revealing a sharp separation in expressivity between ICNNs and standard ReLU networks.

[460] Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching

Dongxie Wen, Hanyan Yin, Xiao Zhang, Peng Zhao, Lijun Zhang, Zhewei Wei

Main category: cs.LG

TL;DR: Proposes Dyadic Block Sketching for linear bandits to overcome spectral error issues in sketch-based methods, achieving sublinear regret without prior knowledge of streaming matrix properties.

DetailsMotivation: Sketch-based linear bandits reduce computational complexity but can suffer from vacuous linear regret when streaming matrices have heavy spectral tails due to inappropriate sketch sizes causing substantial spectral error.

Method: Dyadic Block Sketching - a novel multi-scale matrix sketching approach that dynamically adjusts sketch size during learning. It can be integrated with any matrix sketching method providing covariance guarantees.

Result: The algorithm achieves sublinear regret bounds without requiring prior knowledge of streaming matrix properties, demonstrating superior utility-efficiency trade-off in comprehensive experimental evaluation.

Conclusion: Establishes a general framework for efficient sketch-based linear bandits that overcomes spectral error issues through adaptive sketch sizing, providing robust performance across varying matrix properties.

Abstract: Linear bandits have become a cornerstone of online learning and sequential decision-making, providing solid theoretical foundations for balancing exploration and exploitation. Within this domain, matrix sketching serves as a critical component for achieving computational efficiency, especially when confronting high-dimensional problem instances. The sketch-based approaches reduce per-round complexity from $Ω(d^2)$ to $O(dl)$, where $d$ is the dimension and $l<d$ is the sketch size. However, this computational efficiency comes with a fundamental pitfall: when the streaming matrix exhibits heavy spectral tails, such algorithms can incur vacuous \textit{linear regret}. In this paper, we revisit the regret bounds and algorithmic design for sketch-based linear bandits. Our analysis reveals that inappropriate sketch sizes can lead to substantial spectral error, severely undermining regret guarantees. To overcome this issue, we propose Dyadic Block Sketching, a novel multi-scale matrix sketching approach that dynamically adjusts the sketch size during the learning process. We apply this technique to linear bandits and demonstrate that the new algorithm achieves \textit{sublinear regret} bounds without requiring prior knowledge of the streaming matrix properties. It establishes a general framework for efficient sketch-based linear bandits, which can be integrated with any matrix sketching method that provides covariance guarantees. Comprehensive experimental evaluation demonstrates the superior utility-efficiency trade-off achieved by our approach.

[461] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

Hanlin Gu, Hong Xi Tae, Lixin Fan, Chee Seng Chan

Main category: cs.LG

TL;DR: First method for label unlearning in Vertical Federated Learning using representation-level manifold mixup to generate synthetic embeddings for gradient-based forgetting and recovery.

DetailsMotivation: Addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), which has received far less attention than horizontal federated learning, focusing on label unlearning where labels serve dual roles as essential inputs and sensitive information.

Method: Employs representation-level manifold mixup to generate synthetic embeddings for both unlearned and retained samples, followed by gradient-based label forgetting to remove associated label information, and a recovery-phase optimization step to refine remaining embeddings.

Result: Validated through extensive experiments on diverse datasets (MNIST, CIFAR-10, CIFAR-100, ModelNet, Brain Tumor MRI, COVID-19 Radiography, Yahoo Answers) demonstrating strong efficacy and scalability.

Conclusion: Establishes a new direction for unlearning in VFL, showing that re-imagining mixup as an efficient mechanism can unlock practical and utility-preserving unlearning.

Abstract: This paper addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), a setting that has received far less attention than its horizontal counterpart. Specifically, we propose the first method tailored to \textit{label unlearning} in VFL, where labels play a dual role as both essential inputs and sensitive information. To this end, we employ a representation-level manifold mixup mechanism to generate synthetic embeddings for both unlearned and retained samples. This is to provide richer signals for the subsequent gradient-based label forgetting and recovery steps. These augmented embeddings are then subjected to gradient-based label forgetting, effectively removing the associated label information from the model. To recover performance on the retained data, we introduce a recovery-phase optimization step that refines the remaining embeddings. This design achieves effective label unlearning while maintaining computational efficiency. We validate our method through extensive experiments on diverse datasets, including MNIST, CIFAR-10, CIFAR-100, ModelNet, Brain Tumor MRI, COVID-19 Radiography, and Yahoo Answers demonstrate strong efficacy and scalability. Overall, this work establishes a new direction for unlearning in VFL, showing that re-imagining mixup as an efficient mechanism can unlock practical and utility-preserving unlearning. The code is publicly available at https://github.com/bryanhx/Towards-Privacy-Guaranteed-Label-Unlearning-in-Vertical-Federated-Learning

[462] Quantifying Climate Change Impacts on Renewable Energy Generation: A Super-Resolution Recurrent Diffusion Model

Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei

Main category: cs.LG

TL;DR: A super-resolution recurrent diffusion model (SRDM) enhances temporal resolution of climate data to better model renewable energy generation under climate change scenarios.

DetailsMotivation: Climate data lacks sufficient hourly resolution needed for accurate renewable energy modeling, creating a gap between climate science requirements and power system planning needs for the energy transition.

Method: Developed SRDM with pre-trained decoder and denoising network using recurrent coupling mechanism to generate long-term, high-resolution climate data from low-resolution inputs, then converted to power values using mechanism models.

Result: SRDM outperforms existing generative models for super-resolution climate data generation and reveals estimation biases when using low-resolution climate data for power conversion in Ejina region case studies.

Conclusion: The SRDM approach effectively bridges the resolution gap in climate data for renewable energy modeling, providing more accurate power generation forecasts under climate change scenarios.

Abstract: Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.

[463] Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure

Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter

Main category: cs.LG

TL;DR: G-NLL: A computationally efficient uncertainty estimation method for LLMs using greedy decoding and negative log-likelihood of the most likely sequence, achieving SOTA performance without multiple sequence generation.

DetailsMotivation: Current uncertainty estimation methods for LLMs require generating multiple output sequences, which is computationally expensive and impractical at scale. There's a need for more efficient approaches to evaluate trustworthiness of generated text.

Method: Proposes G-NLL, which uses a single output sequence from greedy decoding to approximate the negative log-likelihood of the most likely output sequence. This builds on proper scoring rules framework to create a theoretically principled uncertainty measure.

Result: G-NLL achieves state-of-the-art performance across various scenarios while being computationally efficient. It demonstrates that complex, resource-intensive methods may not be necessary for reliable uncertainty estimation.

Conclusion: The work provides theoretical foundation for efficient uncertainty estimation in natural language generation, challenging the necessity of prevalent complex methods. G-NLL offers a streamlined approach that preserves theoretical rigor.

Abstract: Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Leading uncertainty estimation methods generate and analyze multiple output sequences, which is computationally expensive and impractical at scale. In this work, we inspect the theoretical foundations of these methods and explore new directions to enhance computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically principled uncertainty measure. To approximate this alternative measure, we propose G-NLL, obtained using a single output sequence from greedy decoding. This approach streamlines uncertainty estimation while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various scenarios. Our work lays the theoretical foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of the prevalent methods that are more complex and resource-intensive.

[464] The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective

Michael Muehlebach, Zhiyu He, Michael I. Jordan

Main category: cs.LG

TL;DR: The paper studies sample complexity of online reinforcement learning in nonlinear dynamical systems with continuous state/action spaces, providing regret bounds for various system classes including neural networks and transformers.

DetailsMotivation: To understand the fundamental sample complexity limits of online RL in continuous nonlinear dynamical systems, which is crucial for practical applications but lacks theoretical guarantees compared to simpler linear systems.

Method: Theoretical analysis of policy regret bounds using packing numbers to measure function class complexity, with algorithms that incorporate prior knowledge and handle various system classes from finite nonlinear models to parameterized systems like neural networks.

Result: Achieved policy regret bounds of O(Nε² + d_u ln(m(ε))/ε²) for general systems and O(√(d_u N p)) for parameterized systems (neural networks/transformers), recovering linear system results as special cases.

Conclusion: The paper provides fundamental sample complexity guarantees for online RL in nonlinear continuous systems, with practical algorithms that are simple, incorporate prior knowledge, and have good transient behavior.

Abstract: We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N ε^2 + d_\mathrm{u}\mathrm{ln}(m(ε))/ε^2)$, where $N$ is the time horizon, $ε$ is a user-specified discretization width, $d_\mathrm{u}$ the input dimension, and $m(ε)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{d_\mathrm{u}N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behaviors.

[465] InfoBridge: Mutual Information estimation via Bridge Matching

Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin

Main category: cs.LG

TL;DR: Diffusion bridge models used for mutual information estimation by framing it as a domain transfer problem

DetailsMotivation: To address mutual information estimation problems that are difficult for conventional estimators, leveraging the power of diffusion bridge models for generative modeling

Method: Frames MI estimation as a domain transfer problem using diffusion bridge models to construct an unbiased estimator

Result: Showcases performance on three standard MI estimation benchmarks (low-dimensional, image-based, high MI) and real-world protein language model embeddings

Conclusion: Diffusion bridge models provide an effective approach for mutual information estimation across various challenging scenarios

Abstract: Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low-dimensional, image-based and high MI, and on real-world data, i.e., protein language model embeddings.

[466] Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions

Farhad Pourkamali-Anaraki

Main category: cs.LG

TL;DR: TDistNNs: Neural networks with t-distributed outputs for better uncertainty quantification in regression, producing narrower prediction intervals than Gaussian-based methods while maintaining coverage.

DetailsMotivation: Traditional neural networks only provide point estimates without uncertainty quantification. Probabilistic neural networks (PNNs) address this but assume Gaussian distributions, leading to overly wide prediction intervals, especially with outliers or non-Gaussian data.

Method: Propose t-Distributed Neural Networks (TDistNNs) that generate t-distributed outputs parameterized by location, scale, and degrees of freedom. Incorporate t-distribution likelihood into neural network training and derive efficient gradient computations for deep learning integration.

Result: Empirical evaluations on synthetic and real-world data show TDistNNs improve balance between coverage and interval width. For identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage.

Conclusion: TDistNNs provide a flexible framework for uncertainty estimation in neural network regression, particularly suited for settings with complex output distributions and heavy-tailed data.

Abstract: Traditional neural network regression models provide only point estimates, failing to capture predictive uncertainty. Probabilistic neural networks (PNNs) address this limitation by producing output distributions, enabling the construction of prediction intervals. However, the common assumption of Gaussian output distributions often results in overly wide intervals, particularly in the presence of outliers or deviations from normality. To enhance the adaptability of PNNs, we propose t-Distributed Neural Networks (TDistNNs), which generate t-distributed outputs, parameterized by location, scale, and degrees of freedom. The degrees of freedom parameter allows TDistNNs to model heavy-tailed predictive distributions, improving robustness to non-Gaussian data and enabling more adaptive uncertainty quantification. We incorporate a likelihood based on the t-distribution into neural network training and derive efficient gradient computations for seamless integration into deep learning frameworks. Empirical evaluations on synthetic and real-world data demonstrate that TDistNNs improve the balance between coverage and interval width. Notably, for identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage. This work contributes a flexible framework for uncertainty estimation in neural networks tasked with regression, particularly suited to settings involving complex output distributions.

[467] Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

Feng Chen, Yefei He, Lequan Lin, Chenhui Gou, Jing Liu, Bohan Zhuang, Qi Wu

Main category: cs.LG

TL;DR: Sparsity Forcing: RL-based post-training framework for MLLMs that optimizes token sparsity through multi-rollout contrastive learning with joint efficiency-performance rewards, achieving up to 75% token reduction with minimal accuracy loss.

DetailsMotivation: Existing sparse attention methods either exploit inherent model sparsity (plateauing at ~50% reduction) or use rigid trainable patterns that ignore input/layer dynamics. There's a need for methods that can push token budgets lower while maintaining accuracy through direct control over sparsity.

Method: RL-based post-training framework that runs multiple rollouts with different token budgets, formulating both efficiency (token reduction ratio) and performance (answer correctness) as joint rewards. Contrasts rollouts within groups to reward more efficient/correct answers and penalize less efficient/incorrect ones.

Result: Achieves token reduction from 20% to 75% on Qwen2-VL/Qwen2.5-VL with minimal accuracy decline across 13 image/video benchmarks. Reduces long-context inference memory by up to 3× and speeds up decoding by up to 3.3×.

Conclusion: Sparsity Forcing successfully turns token saving into an end-to-end, inference-consistent optimization objective, enabling significant efficiency gains in MLLMs while maintaining accuracy through direct RL-based sparsity control.

Abstract: Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model’s inherent sparsity and thus plateau at moderate budgets (about 50% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named \textit{Sparsity Forcing}. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20% to 75% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.

[468] PAPN: Proximity Attention Encoder and Pointer Network Decoder for Parcel Pickup Route Prediction

Hansi Denis, Ali Anwar, Ngoc-Quang Luong, Siegfried Mercelis

Main category: cs.LG

TL;DR: Proximity Attention mechanism combined with Pointer Network decoder for parcel pickup route prediction, outperforming supervised methods on real-world logistics dataset.

DetailsMotivation: Optimizing last-mile delivery and first-mile pickup requires accurate route prediction systems to adapt to different scenarios in advance, which is crucial for cost efficiency and service quality in logistics.

Method: Novel Proximity Attention (PA) mechanism coupled with Pointer Network decoder to leverage connections between visitable pickup positions. Combines local attention with global context via multi-head attention transformer encoder, using PA in decoding to skew predictions toward locations with highest visit likelihood.

Result: Outperforms all state-of-the-art supervised methods on LaDE (2024) dataset in most metrics, competitive with best-performing reinforcement learning framework DRL4Route (2023).

Conclusion: Proximity Attention with Pointer Network provides effective route prediction for logistics optimization, demonstrating superior performance on real-world parcel pickup/delivery data.

Abstract: Optimization of the last-mile delivery and first-mile pickup of parcels is integral to the logistics optimization pipeline as it entails both cost and resource efficiency and a heightened service quality. Such optimization requires accurate route and time prediction systems to adapt to different scenarios in advance. This work tackles the first building block, namely route prediction. The novel Proximity Attention (PA) mechanism is coupled to a Pointer Network (PN) decoder to leverage the underlying connections between the different visitable pickup positions at each timestep of the parcel pickup process. This local attention is coupled with global context computing via a multi-head attention transformer encoder. Both attentions are then mixed for complete and comprehensive modeling of the problems. PA is also used in the decoding process to skew predictions towards the locations with the highest visit likeliness, thus using inter-connectivity of nodes for next-location prediction. This method is trained, validated and tested on a large industry-level dataset of real-world, last-mile delivery and first-mile pickup named LaDE (2024). This approach outperforms all state-of-the-art supervised methods in terms of most metrics used for benchmarking on this dataset while still being competitive with the best-performing reinforcement learning framework named DRL4Route (2023).

[469] Manifold Learning with Normalizing Flows: Towards Regularity, Expressivity and Iso-Riemannian Geometry

Willem Diepeveen, Deanna Needell

Main category: cs.LG

TL;DR: The paper proposes methods to improve learned Riemannian geometry for multi-modal data by isometrizing the learned structure and balancing diffeomorphism regularity/expressivity to address distortions and modeling errors.

DetailsMotivation: The manifold hypothesis suggests high-dimensional data lie near low-dimensional manifolds, and learning Riemannian geometry can improve performance in tasks like clustering and dimensionality reduction. However, real-world multi-modal data introduces distortions and modeling errors that need to be addressed.

Method: The paper proposes two approaches: 1) isometrizing the learned Riemannian structure to address distortions, and 2) balancing regularity and expressivity of the diffeomorphism parametrization to handle modeling errors in multi-modal settings.

Result: The proposed synergy of approaches is shown to be effective in numerical experiments with both synthetic and real data, demonstrating improved handling of multi-modal data challenges.

Conclusion: The work addresses key challenges in applying learned Riemannian geometry to real-world multi-modal data through isometrization and balanced diffeomorphism parametrization, enabling more principled non-linear data analysis and interpretable machine learning.

Abstract: Modern machine learning increasingly leverages the insight that high-dimensional data often lie near low-dimensional, non-linear manifolds, an idea known as the manifold hypothesis. By explicitly modeling the geometric structure of data through learning Riemannian geometry algorithms can achieve improved performance and interpretability in tasks like clustering, dimensionality reduction, and interpolation. In particular, learned pullback geometry has recently undergone transformative developments that now make it scalable to learn and scalable to evaluate, which further opens the door for principled non-linear data analysis and interpretable machine learning. However, there are still steps to be taken when considering real-world multi-modal data. This work focuses on addressing distortions and modeling errors that can arise in the multi-modal setting and proposes to alleviate both challenges through isometrizing the learned Riemannian structure and balancing regularity and expressivity of the diffeomorphism parametrization. We showcase the effectiveness of the synergy of the proposed approaches in several numerical experiments with both synthetic and real data.

[470] Apprenticeship learning with prior beliefs using inverse optimization

Mauricio Junca, Esteban Leiva

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2505.21639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets

Giannis Nikolentzos, Konstantinos Skianis

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2505.24403 appears to be an arXiv paper, but the content could not be retrieved.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2505.24403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma

Main category: cs.LG

TL;DR: Unable to analyze paper 2508.01148 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2508.01148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] Physics-Informed Time-Integrated DeepONet: Temporal Tangent Space Operator Learning for High-Accuracy Inference

Luis Mandl, Dibyajyoti Nayak, Tim Ricken, Somdatta Goswami

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper content

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.05190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.05629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] Federated Nonlinear System Identification

Omkar Tupe, Max Hartman, Lav R. Varshney, Saurav Prakash

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2508.15025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting

Ziheng Peng, Shijie Ren, Xinyue Gu, Linxiao Yang, Xiting Wang, Liang Sun

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2509.23159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] Deep Learning for Subspace Regression

Vladimir Fanaskov, Vladislav Trifonov, Alexander Rudikov, Ekaterina Muravleva, Ivan Oseledets

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API access issues

Method: Unable to determine method due to API access issues

Result: Unable to determine results due to API access issues

Conclusion: Unable to determine conclusion due to API access issues

Abstract: Failed to fetch summary for 2509.23249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] In-Context Learning of Temporal Point Processes with Foundation Inference Models

David Berghaus, Patrick Seifner, Kostadin Cvejoski, César Ojeda, Ramsés J. Sánchez

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing the paper

Conclusion: Unable to provide analysis due to arXiv API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2509.24762: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24762&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning

Zheng Zhang, Ziwei Shan, Kaitao Song, Yexin Li, Kan Ren

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.26578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[480] Synthesising Counterfactual Explanations via Label-Conditional Gaussian Mixture Variational Autoencoders

Junqi Jiang, Francesco Leofante, Antonio Rago, Francesca Toni

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to technical access issues

Abstract: Failed to fetch summary for 2510.04855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[481] Learning Mixtures of Linear Dynamical Systems via Hybrid Tensor-EM Method

Lulu Gong, Shreya Saxena

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.06091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[482] Uncertainty-aware data assimilation through variational inference

Anthony Frion, David S Greenberg

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.17268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[483] Unified Privacy Guarantees for Decentralized Learning via Matrix Factorization

Aurélien Bellet, Edwige Cyffers, Davide Frey, Romaric Gaudel, Dimitri Lerévérend, François Taïani

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2510.17480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[484] FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Min Zhang

Main category: cs.LG

TL;DR: Unable to analyze paper 2510.22543 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation due to lack of paper content

Method: Cannot determine method due to lack of paper content

Result: Cannot determine results due to lack of paper content

Conclusion: Cannot draw conclusions due to lack of paper content

Abstract: Failed to fetch summary for 2510.22543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[485] Descend or Rewind? Stochastic Gradient Descent Unlearning

Siqiao Mu, Diego Klabjan

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.15983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Log Probability Tracking of LLM APIs

Timothée Chauvin, Erwan Le Merrer, François Taïani, Gilles Tredan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2512.03816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Convex Loss Functions for Support Vector Machines (SVMs) and Neural Networks

Filippo Portera

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to technical limitations

Method: Cannot determine method as paper content is unavailable due to technical limitations

Result: Cannot determine results as paper content is unavailable due to technical limitations

Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations

Abstract: Failed to fetch summary for 2601.21331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[488] Federated-inspired Single-cell Batch Integration in Latent Space

Quang-Huy Nguyen, Zongliang Yue, Hao Chen, Wei-Shinn Ku, Jiaqi Wang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.00423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] Position: Beyond Model-Centric Prediction – Agentic Time Series Forecasting

Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.01776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.02958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] Robust Online Learning

Sajad Ashkezari

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.06775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] Unified Biomolecular Trajectory Generation via Pretrained Variational Bridge

Ziyang Yu, Wenbing Huang, Yang Liu

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract

DetailsMotivation: Unable to determine motivation as abstract could not be retrieved

Method: Unable to determine method as abstract could not be retrieved

Result: Unable to determine results as abstract could not be retrieved

Conclusion: Unable to draw conclusions about paper content due to retrieval error

Abstract: Failed to fetch summary for 2602.07588: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07588&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[493] It’s TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, Chenghao Liu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.12147 suggests it’s from February 2026, which is in the future relative to current date.

DetailsMotivation: Cannot determine motivation due to inability to access paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing access to the abstract and paper details.

Method: Cannot determine method due to inability to access paper content. The arXiv API rate limiting prevents retrieval of the paper’s technical approach.

Result: Cannot determine results due to inability to access paper content. The paper’s findings and experimental outcomes are unavailable due to API access restrictions.

Conclusion: Cannot determine conclusion due to inability to access paper content. The paper’s final takeaways and implications cannot be assessed without access to the abstract or full text.

Abstract: Failed to fetch summary for 2602.12147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] CAMEL: An ECG Language Model for Forecasting Cardiac Events

Neelay Velingker, Alaia Solko-Breslin, Mayank Keoliya, Seewon Choi, Jiayi Xin, Anika Marathe, Alireza Oraii, Rajat Deo, Sameed Khatana, Rajeev Alur, Mayur Naik, Eric Wong

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.15677: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15677&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

Sacchit Kale, Piyushi Manupriya, Pierre Marion, Francis Bach, Anant Raj

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.18946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] Discrete Diffusion with Sample-Efficient Estimators for Conditionals

Karthik Elamvazhuthi, Abhijith Jayakumar, Andrey Y. Lokhov

Main category: cs.LG

TL;DR: Unable to analyze paper 2602.20293 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.20293: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20293&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] When Should a Model Change Its Mind? An Energy-Based Theory and Regularizer for Concept Drift in Electrocardiogram (ECG) Signals

Timothy Oladunni, Blessing Ojeme, Kyndal Maclin, Clyde Baidoo

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.22294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

Main category: cs.LG

TL;DR: Failed to fetch summary for arXiv ID 2602.23116 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to draw conclusions due to failed summary fetch

Abstract: Failed to fetch summary for 2602.23116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[499] Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case

Iskander Azangulov, Andrei Smolensky, Alexander Terenin, Viacheslav Borovitskiy

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2208.14960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2208.14960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[500] Assessment of Spatio-Temporal Predictors in the Presence of Missing and Heterogeneous Data

Daniele Zambon, Cesare Alippi

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to API access limitations

Method: Cannot determine method due to API access limitations

Result: Cannot determine results due to API access limitations

Conclusion: Cannot determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2302.01701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.01701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[501] Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators

Xiucai Ding, Rong Ma

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2405.12317: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.12317&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] Polynomial Scaling is Possible For Neural Operator Approximations of Structured Families of BSDEs

Takashi Furuya, Anastasis Kratsios

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2410.14788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.14788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[503] Forecasting Local Ionospheric Parameters Using Transformers

Daniel J. Alford-Lago, Christopher W. Curtis, Alexander T. Ihler, Katherine A. Zawdie, Douglas P. Drob

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2502.15093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[504] Apple: Toward General Active Perception via Reinforcement Learning

Tim Schneider, Cristiana de Farias, Roberto Calandra, Liming Chen, Jan Peters

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2505.06182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[505] Quantum Learning and Estimation for Coordinated Operation between Distribution Networks and Energy Communities

Yingrui Zhuang, Lin Cheng, Yuji Cao, Tongxin Li, Ning Qi, Yan Xu, Yue Chen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2506.11730: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11730&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[506] Geometric Autoencoder Priors for Bayesian Inversion: Learn First Observe Later

Arnaud Vadeboncoeur, Gregory Duthé, Mark Girolami, Eleni Chatzi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2509.19929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[507] Structure tensor Reynolds-averaged Navier-Stokes turbulence models with equivariant neural networks

Aaron Miller, Sahil Kommalapati, Robert Moser, Petros Koumoutsakos

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.09769 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2511.09769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[508] An operator splitting analysis of Wasserstein–Fisher–Rao gradient flows

Francesca Romana Crucinio, Sahani Pathiraja

Main category: cs.LG

TL;DR: Paper 2511.18060: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to HTTP 429 error when attempting to fetch the paper summary

Method: Unable to determine method due to HTTP 429 error when attempting to fetch the paper summary

Result: Unable to determine results due to HTTP 429 error when attempting to fetch the paper summary

Conclusion: Unable to draw conclusions due to HTTP 429 error when attempting to fetch the paper summary

Abstract: Failed to fetch summary for 2511.18060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[509] Learning to Optimize by Differentiable Programming

Liping Tao, Xindi Tong, Chee Wei Tan

Main category: cs.LG

TL;DR: Paper 2601.16510: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2601.16510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[510] GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference

Thomas Ziller, Shashikant Ilager, Alessandro Tundo, Ezio Bartocci, Leonardo Mariani, Ivona Brandic

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.17551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Runze Tang, Penny Sweetser

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.10594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] Fluids You Can Trust: Property-Preserving Operator Learning for Incompressible Flows

Ramansh Sharma, Matthew Lowery, Houman Owhadi, Varun Shankar

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation without paper content.

Method: Cannot determine method without paper content.

Result: Cannot determine results without paper content.

Conclusion: Cannot draw conclusions without paper content.

Abstract: Failed to fetch summary for 2602.15472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Sparse Bayesian Modeling of EEG Channel Interactions Improves P300 Brain-Computer Interface Performance

Guoxuan Ma, Yuan Zhong, Moyan Li, Yuxiao Nie, Jian Kang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2602.17772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] Implicit Bias and Convergence of Matrix Stochastic Mirror Descent

Danil Akhtiamov, Reza Ghane, Omead Pooladzandi, Babak Hassibi

Main category: cs.LG

TL;DR: Paper 2602.18997: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2602.18997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[515] Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding

Yulun Zhang, Varun Bhatt, Matthew C. Fontaine, Stefanos Nikolaidis, Jiaoyang Li

Main category: cs.MA

TL;DR: This paper introduces Mixed Guidance Graph Optimization (MGGO) for Lifelong Multi-Agent Path Finding, which optimizes both edge weights AND directions (providing strict guidance) rather than just weights (soft guidance).

DetailsMotivation: Current Guidance Graph Optimization (GGO) methods only optimize edge weights, providing soft guidance where high weights discourage but don't prohibit edge usage. The authors identify the need for strict guidance through edge direction optimization to better control agent movement in lifelong MAPF scenarios.

Method: Proposes Mixed Guidance Graph Optimization (MGGO) with two approaches: 1) Two-phase optimization separating edge direction and weight optimization, and 2) Quality Diversity algorithms using neural networks to generate both edge directions and weights. Also incorporates traffic patterns into GGO for edge-direction-aware guidance.

Result: The paper presents MGGO methods that can optimize both edge weights and directions, providing stricter guidance than previous GGO methods that only optimized weights. The methods enable better control over agent movement patterns.

Conclusion: Incorporating edge direction optimization into guidance graph methods provides stricter guidance for lifelong MAPF, addressing limitations of weight-only optimization and improving agent movement control.

Abstract: Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents’ movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.

[516] A Novel Hierarchical Multi-Agent System for Payments Using LLMs

Joon Kiat Chua, Donghao Huang, Zhaoxia Wang

Main category: cs.MA

TL;DR: HMASP is a hierarchical multi-agent system using LLMs to enable end-to-end agentic payment workflows, addressing the gap where current LLM agents can’t handle payment tasks.

DetailsMotivation: Current LLM agents (like OpenAI's Operator and Claude's Computer Use) can automate workflows but cannot handle payment tasks. Existing solutions face challenges in implementing end-to-end agentic payment workflows, creating a gap in agentic capabilities for payment operations.

Method: Proposes HMASP with modular architecture using open-weight or proprietary LLMs. Four hierarchical levels: Conversational Payment Agent (entry point), Supervisor agents, Routing agents, and Process summary agent. Uses shared state variables, decoupled message states, and structured handoff protocols for coordination across agents and workflows.

Result: Experimental results demonstrate feasibility of HMASP. It’s the first LLM-based multi-agent system to implement end-to-end agentic payment workflows.

Conclusion: HMASP lays foundation for extending agentic capabilities into payment domain, enabling LLM agents to handle payment workflows end-to-end.

Abstract: Large language model (LLM) agents, such as OpenAI’s Operator and Claude’s Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end-to-end agentic payment workflows. To address this gap, this research proposes the Hierarchical Multi-Agent System for Payments (HMASP), which provides an end-to-end agentic method for completing payment workflows. The proposed HMASP leverages either open-weight or proprietary LLMs and employs a modular architecture consisting of the Conversational Payment Agent (CPA - first agent level), Supervisor agents (second agent level), Routing agents (third agent level), and the Process summary agent (fourth agent level). The CPA serves as the central entry point, handling all external requests and coordinating subsequent tasks across hierarchical levels. HMASP incorporates architectural patterns that enable modular task execution across agents and levels for payment operations, including shared state variables, decoupled message states, and structured handoff protocols that facilitate coordination across agents and workflows. Experimental results demonstrate the feasibility of the proposed HMASP. To our knowledge, HMASP is the first LLM-based multi-agent system to implement end-to-end agentic payment workflows. This work lays a foundation for extending agentic capabilities into the payment domain.

[517] Sharing is caring: data sharing in multi-agent supply chains

Wan Wang, Haiyan Wang, Adam Sobey

Main category: cs.MA

TL;DR: Multi-agent system for supply networks where factory agent can strategically share information downstream (truth, lies, or mixed) to improve system performance, with cooperative reward shaping enhancing benefits.

DetailsMotivation: Most multi-agent supply network models assume full observability through shared policies, which is unrealistic due to data privacy concerns. Alternative Hidden-Markov approaches are challenging. Need realistic approach where agents can strategically share information without full data disclosure.

Method: Propose multi-agent system where factory agent can choose information sharing strategy: no sharing, lying, telling truth, or mixed strategies. Combine with cooperative reward shaping to align incentives. Evaluate in different demand scenarios.

Result: Data sharing boosts performance, especially with cooperative reward shaping. High demand: limited strategy flexibility, lying benefits factory slightly but overall system improvement small. Low demand: truth-telling benefits all actors significantly.

Conclusion: Strategic information sharing in multi-agent supply networks can improve system performance without requiring full data disclosure. Truth-telling works best in low demand, while lying provides marginal gains in high demand scenarios.

Abstract: Modern supply networks are complex interconnected systems. Multi-agent models are increasingly explored to optimise their performance. Most research assumes agents will have full observability of the system by having a single policy represent the agents, which seems unrealistic as this requires companies to share their data. The alternative is to develop a Hidden-Markov Process with separate policies, making the problem challenging to solve. In this paper, we propose a multi-agent system where the factory agent can share information downstream, increasing the observability of the environment. It can choose to share no information, lie, tell the truth or combine these in a mixed strategy. The results show that data sharing can boost the performance, especially when combined with a cooperative reward shaping. In the high demand scenario there is limited ability to change the strategy and therefore no data sharing approach benefits both agents. However, lying benefits the factory enough for an overall system improvement, although only by a relatively small amount compared to the overall reward. In the low demand scenario, the most successful data sharing is telling the truth which benefits all actors significantly.

[518] City Editing: Hierarchical Agentic Execution for Dependency-Aware Urban Geospatial Modification

Rui Liu, Steven Jige Quan, Zhong-Ren Peng, Zijun Yao, Han Wang, Zhengzhang Chen, Kunpeng Liu, Yanjie Fu, Dongjie Wang

Main category: cs.MA

TL;DR: A hierarchical agentic framework for automated urban plan editing using GeoJSON representation and multimodal reasoning to decompose natural language instructions into geometric operations.

DetailsMotivation: Urban renewal requires efficient modification of existing plans rather than complete re-planning, but current manual geospatial layout editing is slow and labor-intensive, hindering iterative planning and decision-making.

Method: Formulates urban renewal as machine-executable task using GeoJSON representation. Decomposes natural-language editing instructions into hierarchical geometric intents (polygon-, line-, point-level operations). Proposes hierarchical agentic framework for multi-level planning and execution with explicit spatial constraint propagation. Introduces iterative execution-validation mechanism to mitigate error accumulation and enforce global spatial consistency.

Result: Extensive experiments across diverse urban editing scenarios demonstrate significant improvements in efficiency, robustness, correctness, and spatial validity over existing baselines.

Conclusion: The proposed hierarchical agentic framework enables efficient automated urban plan editing through structured geospatial representation and multimodal reasoning, addressing practical challenges in urban renewal workflows.

Abstract: As cities evolve over time, challenges such as traffic congestion and functional imbalance increasingly necessitate urban renewal through efficient modification of existing plans, rather than complete re-planning. In practice, even minor urban changes require substantial manual effort to redraw geospatial layouts, slowing the iterative planning and decision-making procedure. Motivated by recent advances in agentic systems and multimodal reasoning, we formulate urban renewal as a machine-executable task that iteratively modifies existing urban plans represented in structured geospatial formats. More specifically, we represent urban layouts using GeoJSON and decompose natural-language editing instructions into hierarchical geometric intents spanning polygon-, line-, and point-level operations. To coordinate interdependent edits across spatial elements and abstraction levels, we propose a hierarchical agentic framework that jointly performs multi-level planning and execution with explicit propagation of intermediate spatial constraints. We further introduce an iterative execution-validation mechanism that mitigates error accumulation and enforces global spatial consistency during multi-step editing. Extensive experiments across diverse urban editing scenarios demonstrate significant improvements in efficiency, robustness, correctness, and spatial validity over existing baselines.

[519] QD-MAPPER: A Quality Diversity Framework to Automatically Evaluate Multi-Agent Path Finding Algorithms in Diverse Maps

Cheng Qian, Yulun Zhang, Varun Bhatt, Matthew Christopher Fontaine, Stefanos Nikolaidis, Jiaoyang Li

Main category: cs.MA

TL;DR: QD-MAPPER uses Quality Diversity algorithms with Neural Cellular Automata to automatically generate diverse maps for comprehensive evaluation of Multi-Agent Path Finding algorithms, addressing limitations of fixed human-designed test maps.

DetailsMotivation: Current MAPF algorithm evaluation relies on limited human-designed maps, which may not cover all scenarios and can lead to algorithm overfitting. There's a need for systematic evaluation on diverse maps to better understand algorithm performance and enable fair comparisons.

Method: Proposes QD-MAPPER framework combining Quality Diversity algorithms with Neural Cellular Automata to automatically generate diverse maps with patterns. Uses this to evaluate various MAPF algorithm types including search-based, priority-based, rule-based, and learning-based approaches.

Result: Enables identification of patterns where each MAPF algorithm excels and detection of disparities in runtime or success rates between different algorithms through both single-algorithm experiments and algorithm comparisons.

Conclusion: QD-MAPPER provides a general framework for comprehensive MAPF algorithm evaluation, offering insights for algorithm selection and design improvements by generating diverse test scenarios beyond human-designed maps.

Abstract: We use the Quality Diversity (QD) algorithm with Neural Cellular Automata (NCA) to automatically evaluate Multi-Agent Path Finding (MAPF) algorithms by generating diverse maps. Previously, researchers typically evaluate MAPF algorithms on a set of specific, human-designed maps at their initial stage of algorithm design. However, such fixed maps may not cover all scenarios, and algorithms may overfit to the small set of maps. To seek further improvements, systematic evaluations on a diverse suite of maps are needed. In this work, we propose Quality-Diversity Multi-Agent Path Finding Performance EvaluatoR (QD-MAPPER), a general framework that takes advantage of the QD algorithm to comprehensively understand the performance of MAPF algorithms by generating maps with patterns, be able to make fair comparisons between two MAPF algorithms, providing further information on the selection between two algorithms and on the design of the algorithms. Empirically, we employ this technique to evaluate and compare the behavior of different types of MAPF algorithms, including search-based, priority-based, rule-based, and learning-based algorithms. Through both single-algorithm experiments and comparisons between algorithms, researchers can identify patterns that each MAPF algorithm excels and detect disparities in runtime or success rates between different algorithms.

cs.MM

[520] MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang

Main category: cs.MM

TL;DR: MSVBench is a comprehensive benchmark for evaluating multi-shot video generation, featuring hierarchical scripts and reference images, with a hybrid evaluation framework combining LMMs and domain-specific models.

DetailsMotivation: Current video generation evaluation methods are inadequate for complex, multi-shot narratives as they remain anchored to single-shot paradigms, lacking comprehensive story assets and cross-shot metrics needed to assess long-form coherence and appeal.

Method: Introduces MSVBench with hierarchical scripts and reference images for multi-shot video generation. Proposes a hybrid evaluation framework that synergizes Large Multimodal Models (LMMs) for high-level semantic reasoning with domain-specific expert models for fine-grained perceptual assessment.

Result: Evaluation of 20 video generation methods reveals current models primarily behave as visual interpolators rather than true world models. MSVBench achieves 94.4% Spearman’s rank correlation with human judgments. Fine-tuning a lightweight model on pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.

Conclusion: MSVBench addresses critical gaps in multi-shot video generation evaluation and provides a scalable supervisory signal for model improvement, demonstrating that current models lack true world modeling capabilities despite strong visual fidelity.

Abstract: The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models–despite strong visual fidelity–primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman’s rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.

[521] M3TR: Temporal Retrieval Enhanced Multi-Modal Micro-video Popularity Prediction

Jiacheng Lu, Weijian Wang, Mingyuan Xiao, Yang Hua, Tao Song, Jiaru Zhang, Bo Peng, Cheng Hua, Haibing Guan

Main category: cs.MM

TL;DR: M³TR is a temporal retrieval enhanced multi-modal framework for micro-video popularity prediction that combines fine-grained temporal modeling with temporal-aware retrieval using Mamba-Hawkes Process and multi-modal content analysis.

DetailsMotivation: Existing methods fail to capture complex temporal patterns in micro-video popularity dynamics, particularly overlooking mutually exciting/decaying user feedback interactions and relying on static content similarity rather than temporal popularity evolution patterns.

Method: Proposes M³TR with: 1) Mamba-Hawkes Process module to model user feedback as self-exciting events capturing long-range dependencies, 2) Temporal-aware retrieval engine combining multi-modal content (visual, audio, text) similarity with popularity trajectory similarity, 3) Feature augmentation using retrieved knowledge.

Result: Achieves state-of-the-art performance on two real-world datasets, outperforming previous methods by up to 19.3% in nMSE with significant gains in long-term prediction challenges.

Conclusion: M³TR effectively addresses limitations of existing methods by synergizing temporal modeling with temporal-aware retrieval, providing comprehensive understanding for micro-video popularity prediction.

Abstract: Accurately predicting the popularity of micro-videos is a critical but challenging task, characterized by volatile, `rollercoaster-like’ engagement dynamics. Existing methods often fail to capture these complex temporal patterns, leading to inaccurate long-term forecasts. This failure stems from two fundamental limitations: \ding{172} a superficial understanding of user feedback dynamics, which overlooks the mutually exciting and decaying nature of interactions such as likes, comments, and shares; and~\ding{173} retrieval mechanisms that rely solely on static content similarity, ignoring the crucial patterns of how a video’s popularity evolves over time. To address these limitations, we propose \textbf{M$^3$TR}, a \textbf{T}emporal \textbf{R}etrieval enhanced \textbf{M}ulti-\textbf{M}odal framework that uniquely synergizes fine-grained temporal modeling with a novel temporal-aware retrieval process for \textbf{M}icro-video popularity prediction. At its core, M$^3$TR introduces a Mamba-Hawkes Process (MHP) module to explicitly model user feedback as a sequence of self-exciting events, capturing the intricate, long-range dependencies within user interactions (for \textbf{limitation} \ding{172}). This rich temporal representation then powers a temporal-aware retrieval engine that identifies historically relevant videos based on a combined similarity of both their multi-modal content (visual, audio, text) and their popularity trajectories (for \textbf{limitation} \ding{173}). By augmenting the target video’s features with this retrieved knowledge, M$^3$TR achieves a comprehensive understanding of prediction. Extensive experiments on two real-world datasets demonstrate the superiority of our framework. M$^3$TR achieves state-of-the-art performance, outperforming previous methods by up to \textbf{19.3}% in nMSE and showing significant gains in addressing long-term prediction challenges.

eess.AS

[522] An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

Wonwoo Jeong

Main category: eess.AS

TL;DR: Analysis of Fréchet Audio Distance (FAD) reveals systematic biases in text-to-audio evaluation due to encoder training tasks, showing trade-offs between different encoders across recall, precision, and alignment dimensions.

DetailsMotivation: FAD is the standard for text-to-audio evaluation, but its scores depend on encoder embedding spaces. Different encoders are trained on different tasks (reconstruction, ASR, classification), causing systematic biases in what acoustic features they preserve or discard, making FAD scores incomparable across encoders.

Method: Decompose evaluation into Recall, Precision, and Alignment (with semantic and structural dimensions). Use log-scale normalization for fair cross-encoder comparison. Conduct controlled experiments on six encoders across two datasets to analyze trade-offs.

Result: Reveals four-axis trade-off: AudioMAE (reconstruction-based) leads precision sensitivity; Whisper (ASR-trained) dominates structural detection but is blind to signal degradation; VGGish (classification-trained) maximizes semantic detection but penalizes legitimate intra-class variation. No single encoder serves as universal evaluator.

Conclusion: Future audio evaluation metrics must shift toward evaluation-native encoders that are intrinsically aligned with human perception rather than relying on encoders trained for other tasks.

Abstract: Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder’s embedding space. An encoder’s training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

[523] Discrete Optimal Transport and Voice Conversion

Anton Selitskiy, Maitreya Kocharekar

Main category: eess.AS

TL;DR: Voice conversion using discrete optimal transport for audio embedding alignment, with evaluation showing high-quality conversion and discovery of adversarial attack potential.

DetailsMotivation: Address voice conversion task using vector-based interface, aiming to align audio embeddings across speakers for effective voice transformation.

Method: Employ discrete optimal transport to align audio embeddings across speakers, approximate transport map using barycentric projection, extend previous work on kNN and OT averaging.

Result: Approach yields high-quality voice conversion; ablation study on embedding numbers; discrete OT as post-processing can cause synthetic speech to be misclassified as real, revealing novel adversarial attack.

Conclusion: Discrete optimal transport is effective for voice conversion and reveals security vulnerabilities in audio generation systems through adversarial misclassification.

Abstract: In this work, we address the task of voice conversion (VC) using a vector-based interface. To align audio embeddings across speakers, we employ discrete optimal transport (OT) and approximate the transport map using the barycentric projection. Our evaluation demonstrates that this approach yields high-quality and effective voice conversion. We also perform an ablation study on the number of embeddings used, extending previous work on simple averaging of kNN and OT results. Additionally, we show that applying discrete OT as a post-processing step in audio generation can cause synthetic speech to be misclassified as real, revealing a novel and strong adversarial attack.

[524] Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Main category: eess.AS

TL;DR: Diffusion-based LLM LLaDA improves ASR accuracy when used as deliberation module with Whisper-LLaMA, achieving 12.3% relative WER reduction on LibriSpeech test-other.

DetailsMotivation: Explore diffusion-based large language models as an alternative to autoregressive decoders for automatic speech recognition, investigating their potential for improving accuracy through bidirectional attention and denoising capabilities.

Method: Use LLaDA diffusion model as external deliberation module for Whisper-LLaMA transcripts with random masking, low-confidence masking, and semi-autoregressive strategies; also evaluate as standalone ASR decoder with diffusion-based and semi-autoregressive decoding.

Result: Whisper-LLaDA cascade achieves 2.25%/4.94% WER on LibriSpeech test-clean/test-other (12.3% relative improvement over baseline); standalone decoder shows faster inference but slightly lower accuracy; audio-conditioned embeddings crucial for improvement.

Conclusion: Diffusion-based LLMs show promise for ASR, particularly as deliberation modules, with bidirectional attention and denoising capabilities offering accuracy improvements; audio features essential for success.

Abstract: Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements. Code and model are open-sourced at https://github.com/liuzhan22/Diffusion-ASR.

[525] Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

Main category: eess.AS

TL;DR: Resp-Agent is a multimodal system for respiratory auscultation that uses an active adversarial curriculum agent to address information loss from spectrogram conversion and data scarcity through modality-weaving diagnoser and flow matching generator.

DetailsMotivation: Current deep learning for respiratory auscultation suffers from information loss when converting signals to spectrograms (discarding transient acoustic events and clinical context) and limited data availability with severe class imbalance.

Method: Uses Active Adversarial Curriculum Agent (Thinker-A²CA) as central controller to identify diagnostic weaknesses and schedule targeted synthesis. Introduces modality-weaving Diagnoser that weaves clinical text with audio tokens via global attention and sparse audio anchors. Designs flow matching Generator that adapts text-only LLM via modality injection to synthesize hard-to-diagnose samples. Built on Resp-229k benchmark corpus of 229k recordings with LLM-distilled clinical narratives.

Result: Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance.

Conclusion: The proposed system effectively addresses fundamental challenges in respiratory auscultation through multimodal integration, active learning, and targeted data synthesis.

Abstract: Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

eess.IV

[526] SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

Yifan Li, Mehrdad Salimitari, Taiyu Zhang, Guang Li, David Dreizin

Main category: eess.IV

TL;DR: SALIENT: A mask-conditioned wavelet-domain diffusion framework for synthesizing paired lesion-masking volumes to improve rare lesion detection in whole-body CT scans under extreme class imbalance.

DetailsMotivation: Rare lesion detection in whole-body CT faces extreme class imbalance and low target-to-volume ratios, causing precision collapse despite high AUROC. Existing diffusion approaches are computationally expensive and lack controllable attribute-level regulation and paired supervision for accountable training.

Method: SALIENT uses mask-conditioned wavelet-domain diffusion that synthesizes paired lesion-masking volumes. Instead of pixel-space denoising, it performs structured diffusion over discrete wavelet coefficients, separating low-frequency brightness from high-frequency structural detail. Uses learnable frequency-aware objectives to disentangle target/background attributes, 3D VAE for volumetric lesion masks, and semi-supervised teacher for paired pseudo-labels.

Result: Improves generative realism with higher MS-SSIM (0.63 to 0.83) and lower FID (118.4 to 46.5). SALIENT-augmented training improves long-tail detection performance with disproportionate AUPRC gains across low prevalences and target-to-volume ratios. Optimal synthetic ratios shift from 2x to 4x as labeled seed size decreases.

Conclusion: Frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection. SALIENT demonstrates effective synthetic augmentation for rare lesion detection under extreme class imbalance.

Abstract: Detection of rare lesions in whole-body CT is fundamentally limited by extreme class imbalance and low target-to-volume ratios, producing precision collapse despite high AUROC. Synthetic augmentation with diffusion models offers promise, yet pixel-space diffusion is computationally expensive, and existing mask-conditioned approaches lack controllable attribute-level regulation and paired supervision for accountable training. We introduce SALIENT, a mask-conditioned wavelet-domain diffusion framework that synthesizes paired lesion-masking volumes for controllable CT augmentation under long-tail regimes. Instead of denoising in pixel space, SALIENT performs structured diffusion over discrete wavelet coefficients, explicitly separating low-frequency brightness from high-frequency structural detail. Learnable frequency-aware objectives disentangle target and background attributes (structure, contrast, edge fidelity), enabling interpretable and stable optimization. A 3D VAE generates diverse volumetric lesion masks, and a semi-supervised teacher produces paired slice-level pseudo-labels for downstream mask-guided detection. SALIENT improves generative realism, as reflected by higher MS-SSIM (0.63 to 0.83) and lower FID (118.4 to 46.5). In a separate downstream evaluation, SALIENT-augmented training improves long-tail detection performance, yielding disproportionate AUPRC gains across low prevalences and target-to-volume ratios. Optimal synthetic ratios shift from 2x to 4x as labeled seed size decreases, indicating a seed-dependent augmentation regime under low-label conditions. SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection.

[527] SGDC: Structurally-Guided Dynamic Convolution for Medical Image Segmentation

Bo Shi, Wei-ping Zhu, M. N. S. Swamy

Main category: eess.IV

TL;DR: SGDC improves medical image segmentation by using structure-guided dynamic convolution instead of average pooling to preserve fine details and boundaries.

DetailsMotivation: Current dynamic convolution methods in medical segmentation use average pooling that collapses high-frequency spatial details, leading to over-smoothed predictions and poor boundary fidelity for clinical structures.

Method: Proposes Structure-Guided Dynamic Convolution (SGDC) with an explicitly supervised structure-extraction branch to guide dynamic kernel generation and gating signals for structure-aware feature modulation, fusing high-fidelity boundary information with semantic features.

Result: Achieves SOTA on ISIC 2016, PH2, ISIC 2018, and CoNIC datasets, reducing Hausdorff Distance (HD95) by 2.05 and providing 0.99%-1.49% IoU gains over pooling-based baselines.

Conclusion: SGDC effectively prevents information loss from average pooling, improves boundary fidelity, and has potential for extension to other fine-grained vision tasks like small-object detection.

Abstract: Spatially variant dynamic convolution provides a principled approach of integrating spatial adaptivity into deep neural networks. However, mainstream designs in medical segmentation commonly generate dynamic kernels through average pooling, which implicitly collapses high-frequency spatial details into a coarse, spatially-compressed representation, leading to over-smoothed predictions that degrade the fidelity of fine-grained clinical structures. To address this limitation, we propose a novel Structure-Guided Dynamic Convolution (SGDC) mechanism, which leverages an explicitly supervised structure-extraction branch to guide the generation of dynamic kernels and gating signals for structure-aware feature modulation. Specifically, the high-fidelity boundary information from this auxiliary branch is fused with semantic features to enable spatially-precise feature modulation. By replacing context aggregation with pixel-wise structural guidance, the proposed design effectively prevents the information loss introduced by average pooling. Experimental results show that SGDC achieves state-of-the-art performance on ISIC 2016, PH2, ISIC 2018, and CoNIC datasets, delivering superior boundary fidelity by reducing the Hausdorff Distance (HD95) by 2.05, and providing consistent IoU gains of 0.99%-1.49% over pooling-based baselines. Moreover, the mechanism exhibits strong potential for extension to other fine-grained, structure-sensitive vision tasks, such as small-object detection, offering a principled solution for preserving structural integrity in medical image analysis. To facilitate reproducibility and encourage further research, the implementation code for both our SGE and SGDC modules has been is publicly released at https://github.com/solstice0621/SGDC.

[528] SegReg: Latent Space Regularization for Improved Medical Image Segmentation

Puru Vaish, Amin Ranem, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink

Main category: eess.IV

TL;DR: SegReg: A latent-space regularization framework for medical image segmentation that improves domain generalization and continual learning by constraining feature representations in U-Net models.

DetailsMotivation: Current medical image segmentation models use voxel-wise losses that only constrain predictions in output space, leaving latent feature representations largely unconstrained, which limits generalization capabilities.

Method: Proposes SegReg, a latent-space regularization framework that operates on feature maps of U-Net models to encourage structured embeddings while remaining fully compatible with standard segmentation losses. Integrated with nnU-Net framework.

Result: Demonstrated consistent improvements in domain generalization on prostate, cardiac, and hippocampus segmentation tasks. Also showed that explicit latent regularization improves continual learning by reducing task drift and enhancing forward transfer across sequential tasks without adding memory or extra parameters.

Conclusion: Latent-space regularization is a practical approach for building more generalizable and continual-learning-ready medical image segmentation models.

Abstract: Medical image segmentation models are typically optimised with voxel-wise losses that constrain predictions only in the output space. This leaves latent feature representations largely unconstrained, potentially limiting generalisation. We propose {SegReg}, a latent-space regularisation framework that operates on feature maps of U-Net models to encourage structured embeddings while remaining fully compatible with standard segmentation losses. Integrated with the nnU-Net framework, we evaluate SegReg on prostate, cardiac, and hippocampus segmentation and demonstrate consistent improvements in domain generalisation. Furthermore, we show that explicit latent regularisation improves continual learning by reducing task drift and enhancing forward transfer across sequential tasks without adding memory or any extra parameters. These results highlight latent-space regularisation as a practical approach for building more generalisable and continual-learning-ready models.

[529] Few-Shot Continual Learning for 3D Brain MRI with Frozen Foundation Models

Chi-Sheng Chen, Xinyu Zhang, Guan-Ying Chen, Qiuzhe Xie, Fan Zhang, En-Jui Kuo

Main category: eess.IV

TL;DR: Frozen foundation models with task-specific LoRA adapters enable few-shot continual learning for 3D medical imaging without catastrophic forgetting.

DetailsMotivation: Foundation models pretrained on large-scale 3D medical imaging data face challenges when adapted to multiple downstream tasks under continual learning with limited labeled data, particularly avoiding catastrophic forgetting while maintaining performance across tasks.

Method: Combine frozen pretrained backbone with task-specific Low-Rank Adaptation (LoRA) modules for few-shot continual learning. Tasks arrive sequentially (tumor segmentation and brain age estimation) with no replay of previous task data. Each task gets dedicated LoRA adapter; only adapter and task-specific head are trained while backbone remains frozen.

Result: LoRA approach achieves best balanced performance: T1 Dice 0.62±0.07, T2 MAE 0.16±0.05, with zero forgetting and <0.1% trainable parameters per task. Sequential full fine-tuning suffers severe forgetting (T1 Dice drops from 0.80 to 0.16), while sequential linear probing achieves strong T1 but fails on T2.

Conclusion: Frozen foundation models with task-specific LoRA adapters offer practical solution for few-shot continual learning when both tasks must be maintained, eliminating catastrophic forgetting by design while maintaining strong performance across tasks.

Abstract: Foundation models pretrained on large-scale 3D medical imaging data face challenges when adapted to multiple downstream tasks under continual learning with limited labeled data. We address few-shot continual learning for 3D brain MRI by combining a frozen pretrained backbone with task-specific Low-Rank Adaptation (LoRA) modules. Tasks arrive sequentially – tumor segmentation (BraTS) and brain age estimation (IXI) – with no replay of previous task data. Each task receives a dedicated LoRA adapter; only the adapter and task-specific head are trained while the backbone remains frozen, thereby eliminating catastrophic forgetting by design (BWT=0). In continual learning, sequential full fine-tuning suffers severe forgetting (T1 Dice drops from 0.80 to 0.16 after T2), while sequential linear probing achieves strong T1 (Dice 0.79) but fails on T2 (MAE 1.45). Our LoRA approach achieves the best balanced performance across both tasks: T1 Dice 0.62$\pm$0.07, T2 MAE 0.16$\pm$0.05, with zero forgetting and $<$0.1% trainable parameters per task, though with noted systematic age underestimation in T2 (Wilcoxon $p<0.001$). Frozen foundation models with task-specific LoRA adapters thus offer a practical solution when both tasks must be maintained under few-shot continual learning.

[530] Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis

Bin Xu, Yufei Zhou, Boling Song, Jingwen Sun, Yang Bian, Cheng Lu, Ye Wu, Jianfei Tu, Xiangxue Wang

Main category: eess.IV

TL;DR: HMKGN is a hierarchical graph network for cancer prognosis from whole-slide images that models multi-scale spatial relationships through dynamic graphs at cellular and slide levels.

DetailsMotivation: Current attention-based MIL methods ignore spatial organization in WSIs, while graph-based MIL uses static handcrafted graphs, limiting their ability to capture hierarchical spatial relationships crucial for cancer prognostication.

Method: Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) with two-level dynamic graphs: local cellular-level graphs aggregate spatially proximate patches within ROIs, and global slide-level graph integrates ROI features. Multi-scale integration combines coarse contextual features with fine-grained structural representations.

Result: Outperforms existing MIL-based models on four TCGA cohorts (KIRC, LGG, PAAD, STAD) with 10.85% better concordance indices and statistically significant patient survival risk stratification (log-rank p < 0.05).

Conclusion: HMKGN effectively models hierarchical spatial relationships in WSIs for improved cancer prognosis, demonstrating the importance of capturing multi-scale spatial organization in computational pathology.

Abstract: We propose a Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) that models multi-scale interactions and spatially hierarchical relationships within whole-slide images (WSIs) for cancer prognostication. Unlike conventional attention-based MIL, which ignores spatial organization, or graph-based MIL, which relies on static handcrafted graphs, HMKGN enforces a hierarchical structure with spatial locality constraints, wherein local cellular-level dynamic graphs aggregate spatially proximate patches within each region of interest (ROI) and a global slide-level dynamic graph integrates ROI-level features into WSI-level representations. Moreover, multi-scale integration at the ROI level combines coarse contextual features from broader views with fine-grained structural representations from local patch-graph aggregation. We evaluate HMKGN on four TCGA cohorts (KIRC, LGG, PAAD, and STAD; N=513, 487, 138, and 370) for survival prediction. It consistently outperforms existing MIL-based models, yielding improved concordance indices (10.85% better) and statistically significant stratification of patient survival risk (log-rank p < 0.05).

[531] Unsupervised Causal Prototypical Networks for De-biased Interpretable Dermoscopy Diagnosis

Junhao Jia, Yueyi Wu, Huangwei Chen, Haodong Jing, Haishuai Wang, Jiajun Bu, Lei Wu

Main category: eess.IV

TL;DR: CausalProto: An unsupervised causal prototypical network that disentangles pathological features from environmental confounders in dermoscopy images to provide transparent, high-purity visual interpretability without compromising diagnostic accuracy.

DetailsMotivation: Deep learning in dermoscopy image analysis suffers from black-box nature that hinders clinical trust. Prototypical networks offer visual transparency but suffer from selection bias in clinical data, leading to shortcut learning where environmental confounders are erroneously encoded as predictive prototypes, generating spurious visual evidence that misleads medical decision-making.

Method: Proposes CausalProto, an Unsupervised Causal Prototypical Network framed within a Structural Causal Model. Uses an Information Bottleneck-constrained encoder to enforce strict unsupervised orthogonal disentanglement between pathological features and environmental confounders. Maps decoupled representations into independent prototypical spaces, leverages learned spurious dictionary to perform backdoor adjustment via do-calculus, transforming complex causal interventions into efficient expectation pooling to marginalize environmental noise.

Result: Extensive experiments on multiple dermoscopy datasets demonstrate that CausalProto achieves superior diagnostic performance and consistently outperforms standard black box models, while simultaneously providing transparent and high purity visual interpretability without suffering from the traditional accuracy compromise.

Conclusion: CausalProto effectively addresses the confounding effects in clinical data by purifying the visual evidence chain through causal disentanglement, offering both improved diagnostic performance and trustworthy visual interpretability for medical decision-making.

Abstract: Despite the success of deep learning in dermoscopy image analysis, its inherent black-box nature hinders clinical trust, motivating the use of prototypical networks for case-based visual transparency. However, inevitable selection bias in clinical data often drives these models toward shortcut learning, where environmental confounders are erroneously encoded as predictive prototypes, generating spurious visual evidence that misleads medical decision-making. To mitigate these confounding effects, we propose CausalProto, an Unsupervised Causal Prototypical Network that fundamentally purifies the visual evidence chain. Framed within a Structural Causal Model, we employ an Information Bottleneck-constrained encoder to enforce strict unsupervised orthogonal disentanglement between pathological features and environmental confounders. By mapping these decoupled representations into independent prototypical spaces, we leverage the learned spurious dictionary to perform backdoor adjustment via do-calculus, transforming complex causal interventions into efficient expectation pooling to marginalize environmental noise. Extensive experiments on multiple dermoscopy datasets demonstrate that CausalProto achieves superior diagnostic performance and consistently outperforms standard black box models, while simultaneously providing transparent and high purity visual interpretability without suffering from the traditional accuracy compromise.

[532] VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video

Deependra Dewagiri, Kamesh Anuradha, Pabadhi Liyanage, Helitha Kulatunga, Pamuditha Somarathne, Udaya S. K. P. Miriya Thanthrige, Nishani Lucas, Anusha Withana, Joshua P. Kulasingham

Main category: eess.IV

TL;DR: VideoPulse: A neonatal dataset and pipeline for contact-free heart rate and SpO2 estimation from facial video using rPPG techniques with 3D CNNs.

DetailsMotivation: Neonates require non-invasive vital sign monitoring since conventional adhesive probes can irritate fragile skin and increase infection risk. Remote photoplethysmography (rPPG) offers contact-free monitoring from facial video.

Method: Face alignment and artifact-aware supervision using denoised pulse oximeter signals, followed by 3D CNN backbones for heart rate and SpO2 regression with label distribution smoothing and weighted regression for SpO2. Predictions in 2-second windows.

Result: Heart rate MAE 2.97 bpm (2-second windows) and SpO2 MAE 1.69% on NBHR dataset. Cross-dataset evaluation shows 5.34 bpm MAE on VideoPulse, and fine-tuned SpO2 model achieves 1.68% MAE.

Conclusion: Short unaligned neonatal video segments can support accurate heart rate and SpO2 estimation, enabling low-cost non-invasive monitoring in neonatal intensive care.

Abstract: Remote photoplethysmography (rPPG) enables contact free monitoring of vital signs and is especially valuable for neonates, since conventional methods often require sustained skin contact with adhesive probes that can irritate fragile skin and increase infection control burden. We present VideoPulse, a neonatal dataset and an end to end pipeline that estimates neonatal heart rate and peripheral capillary oxygen saturation (SpO2) from facial video. VideoPulse contains 157 recordings totaling 2.6 hours from 52 neonates with diverse face orientations. Our pipeline performs face alignment and artifact aware supervision using denoised pulse oximeter signals, then applies 3D CNN backbones for heart rate and SpO2 regression with label distribution smoothing and weighted regression for SpO2. Predictions are produced in 2 second windows. On the NBHR neonatal dataset, we obtain heart rate MAE 2.97 bpm using 2 second windows (2.80 bpm at 6 second windows) and SpO2 MAE 1.69 percent. Under cross dataset evaluation, the NBHR trained heart rate model attains 5.34 bpm MAE on VideoPulse, and fine tuning an NBHR pretrained SpO2 model on VideoPulse yields MAE 1.68 percent. These results indicate that short unaligned neonatal video segments can support accurate heart rate and SpO2 estimation, enabling low cost non invasive monitoring in neonatal intensive care.

[533] Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka, Lihang Hong

Main category: eess.IV

TL;DR: A novel framework using DINOv3 vision foundation model adapted for volumetric vessel segmentation with 3D adapters and multi-scale aggregation, achieving significant improvements in few-shot and out-of-distribution settings.

DetailsMotivation: Vessel segmentation methods require large annotated datasets and suffer from domain shifts, but acquiring extensive annotations for every new scanner/protocol is unfeasible in clinical practice.

Method: Leverages pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation with lightweight 3D Adapter for volumetric consistency, multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to bridge 2D pre-training and 3D medical modalities.

Result: In extreme few-shot regime (5 training samples), achieved Dice score of 43.42% (30% relative improvement over nnU-Net’s 33.41%). In out-of-distribution setting, achieved 50% relative improvement over nnU-Net (21.37% vs. 14.22%). Outperformed other Transformer-based baselines by up to 45%.

Conclusion: Foundation models offer a viable cold-start solution for vessel segmentation, improving clinical reliability under data scarcity or domain shifts. The 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness.

Abstract: State-of-the-art vessel segmentation methods typically require large-scale annotated datasets and suffer from severe performance degradation under domain shifts. In clinical practice, however, acquiring extensive annotations for every new scanner or protocol is unfeasible. To address this, we propose a novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation. We introduce a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities, enabling the model to capture continuous vascular structures from limited data. We validated our method on the TopCoW (in-domain) and Lausanne (out-of-distribution) datasets. In the extreme few-shot regime with 5 training samples, our method achieved a Dice score of 43.42%, marking a 30% relative improvement over the state-of-the-art nnU-Net (33.41%) and outperforming other Transformer-based baselines, such as SwinUNETR and UNETR, by up to 45%. Furthermore, in the out-of-distribution setting, our model demonstrated superior robustness, achieving a 50% relative improvement over nnU-Net (21.37% vs. 14.22%), which suffered from severe domain overfitting. Ablation studies confirmed that our 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness. Our results suggest foundation models offer a viable cold-start solution, improving clinical reliability under data scarcity or domain shifts.

[534] FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy

Hyejin Park, Jiwon Yoon, Sumin Park, Suree Kim, Sinae Jang, Eunsoo Lee, Dongmin Kang, Dongbo Min

Main category: eess.IV

TL;DR: Proposes stain-aware focus quality assessment for fluorescence microscopy using FluoMix dataset and FluoCLIP framework that leverages CLIP for stain-specific focus ranking.

DetailsMotivation: Existing focus quality assessment methods treat focus quality as stain-agnostic, ignoring that fluorescent dyes cause heterogeneous focus shifts. Need for stain-aware modeling in fluorescence microscopy.

Method: Formulates stain-aware FQA task, curates FluoMix dataset with multiple tissues/stains/focus variations, and proposes FluoCLIP - a two-stage vision-language framework using CLIP alignment with stain tokens and stain-specific rank prompts.

Result: FluoCLIP achieves strong generalization across diverse fluorescence microscopy conditions, establishing first foundation for stain-aware focus quality assessment.

Conclusion: Focus behavior in fluorescence microscopy must be modeled as function of staining characteristics, and proposed stain-aware framework with vision-language alignment provides effective solution.

Abstract: Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the stain-dependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of stain-aware FQA, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus-rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose FluoMix, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose FluoCLIP, a two-stage vision-language framework that leverages CLIP’s alignment capability to interpret focus quality in the context of biological staining. In the stain-grounding phase, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the stain-guided ranking phase, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for stain-aware FQA, and FluoCLIP achieves strong generalization across diverse fluorescence microscopy conditions.

[535] BiM-GeoAttn-Net: Linear-Time Depth Modeling with Geometry-Aware Attention for 3D Aortic Dissection CTA Segmentation

Yuan Zhang, Lei Liu, Jialin Zhang, Ya-Nan Zhang, Ling Wang, Nan Mu

Main category: eess.IV

TL;DR: BiM-GeoAttn-Net: A lightweight 3D segmentation framework for aortic dissection using bidirectional depth mamba for cross-slice dependencies and geometry-aware attention for vessel refinement.

DetailsMotivation: Accurate 3D segmentation of aortic dissection lumens in CT angiography is crucial for clinical assessment but challenging due to limited long-range context modeling (affecting inter-slice coherence) and insufficient structural discrimination under low-contrast conditions.

Method: Proposes BiM-GeoAttn-Net integrating linear-time depth-wise state-space modeling with geometry-aware vessel refinement. Uses Bidirectional Depth Mamba (BiM) to capture cross-slice dependencies efficiently and Geometry-Aware Vessel Attention (GeoAttn) module with orientation-sensitive anisotropic filtering to refine tubular structures and sharpen boundaries.

Result: Achieves Dice score of 93.35% and HD95 of 12.36 mm on multi-source AD CTA dataset, outperforming CNN-, Transformer-, and SSM-based baselines in overlap metrics while maintaining competitive boundary accuracy.

Conclusion: Coupling linear-time depth modeling with geometry-aware refinement provides an effective, computationally efficient solution for robust 3D AD segmentation.

Abstract: Accurate segmentation of aortic dissection (AD) lumens in CT angiography (CTA) is essential for quantitative morphological assessment and clinical decision-making. However, reliable 3D delineation remains challenging due to limited long-range context modeling, which compromises inter-slice coherence, and insufficient structural discrimination under low-contrast conditions. To address these limitations, we propose BiM-GeoAttn-Net, a lightweight framework that integrates linear-time depth-wise state-space modeling with geometry-aware vessel refinement. Our approach is featured by Bidirectional Depth Mamba (BiM) to efficiently capture cross-slice dependencies and Geometry-Aware Vessel Attention (GeoAttn) module that employs orientation-sensitive anisotropic filtering to refine tubular structures and sharpen ambiguous boundaries. Extensive experiments on a multi-source AD CTA dataset demonstrate that BiM-GeoAttn-Net achieves a Dice score of 93.35% and an HD95 of 12.36 mm, outperforming representative CNN-, Transformer-, and SSM-based baselines in overlap metrics while maintaining competitive boundary accuracy. These results suggest that coupling linear-time depth modeling with geometry-aware refinement provides an effective, computationally efficient solution for robust 3D AD segmentation.

[536] Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

Tuan Truong, Melanie Dohmen, Sara Lorio, Matthias Lenga

Main category: eess.IV

TL;DR: Multimodal framework for DICOM series classification using joint image-metadata modeling with cross-modal attention and missingness-aware metadata encoding

DetailsMotivation: DICOM series classification is challenging due to heterogeneous slice content, variable series length, and inconsistent/incomplete DICOM metadata, which hinders large-scale medical image analysis

Method: End-to-end multimodal framework with: 1) modality-aware image/metadata encoders, 2) bi-directional cross-modal attention for fusion, 3) sparse missingness-aware metadata encoder using learnable feature dictionaries without imputation, 4) 2.5D visual encoder handling variable series length

Result: Outperforms image-only, metadata-only, and multimodal 2D/3D baselines on Duke Liver MRI dataset and multi-institutional cohort, showing improved robustness and generalization

Conclusion: Explicitly modeling metadata sparsity and cross-modal interactions improves DICOM series classification robustness, enabling better medical image analysis pipelines

Abstract: Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.

[537] Polarization Uncertainty-Guided Diffusion Model for Color Polarization Image Demosaicking

Chenggong Li, Yidong Luo, Junchao Zhang, Degui Yang

Main category: eess.IV

TL;DR: Proposes using text-to-image diffusion priors to improve polarization characteristic reconstruction in color polarization demosaicking, addressing limitations of network-based methods through uncertainty-guided diffusion.

DetailsMotivation: Existing network-based methods for color polarization demosaicking effectively recover scene intensity but have significant errors in reconstructing polarization characteristics (DOP and AOP) due to challenges in predicting numerous missing pixels and scarcity of high-quality training data.

Method: Introduces image diffusion prior from text-to-image models to overcome performance bottlenecks, explicitly models polarization uncertainty during reconstruction, and uses uncertainty to guide diffusion model in recovering high error regions.

Result: Extensive experiments demonstrate the method accurately recovers scene polarization characteristics with both high fidelity and strong visual perception.

Conclusion: The diffusion prior approach effectively addresses limitations of network-based methods for polarization characteristic reconstruction in CPDM.

Abstract: Color polarization demosaicking (CPDM) aims to reconstruct full-resolution polarization images of four directions from the color-polarization filter array (CPFA) raw image. Due to the challenge of predicting numerous missing pixels and the scarcity of high-quality training data, existing network-based methods, despite effectively recovering scene intensity information, still exhibit significant errors in reconstructing polarization characteristics (degree of polarization, DOP, and angle of polarization, AOP). To address this problem, we introduce the image diffusion prior from text-to-image (T2I) models to overcome the performance bottleneck of network-based methods, with the additional diffusion prior compensating for limited representational capacity caused by restricted data distribution. To effectively leverage the diffusion prior, we explicitly model the polarization uncertainty during reconstruction and use uncertainty to guide the diffusion model in recovering high error regions. Extensive experiments demonstrate that the proposed method accurately recovers scene polarization characteristics with both high fidelity and strong visual perception.

[538] Clinically-aligned ischemic stroke segmentation and ASPECTS scoring on NCCT imaging using a slice-gated loss on foundation representations

Hiba Azeem, Behraj Khan, Tahir Qasim Syed

Main category: eess.IV

TL;DR: A deep learning framework for stroke infarct segmentation on non-contrast CT that integrates anatomical priors with foundation model representations to improve ASPECTS scoring accuracy.

DetailsMotivation: Current deep learning methods for stroke infarct assessment on NCCT perform pixel-wise segmentation without modeling the structured anatomical reasoning underlying ASPECTS scoring, where basal ganglia and supraganglionic levels are clinically interpreted in a coupled manner.

Method: Proposes a clinically aligned framework combining a frozen DINOv3 backbone with a lightweight decoder and introduces a Territory-Aware Gated Loss (TAGL) to enforce BG-SG consistency during training without adding inference-time complexity.

Result: Achieves Dice score of 0.6385 on AISD dataset, outperforming prior CNN and foundation-model baselines. On proprietary ASPECTS dataset, TAGL improves mean Dice from 0.698 to 0.767.

Conclusion: Integrating foundation representations with structured clinical priors improves NCCT stroke segmentation and ASPECTS delineation, demonstrating the value of anatomically informed supervision.

Abstract: Rapid infarct assessment on non-contrast CT (NCCT) is essential for acute ischemic stroke management. Most deep learning methods perform pixel-wise segmentation without modeling the structured anatomical reasoning underlying ASPECTS scoring, where basal ganglia (BG) and supraganglionic (SG) levels are clinically interpreted in a coupled manner. We propose a clinically aligned framework that combines a frozen DINOv3 backbone with a lightweight decoder and introduce a Territory-Aware Gated Loss (TAGL) to enforce BG-SG consistency during training. This anatomically informed supervision adds no inference-time complexity. Our method achieves a Dice score of 0.6385 on AISD, outperforming prior CNN and foundation-model baselines. On a proprietary ASPECTS dataset, TAGL improves mean Dice from 0.698 to 0.767. These results demonstrate that integrating foundation representations with structured clinical priors improves NCCT stroke segmentation and ASPECTS delineation.

[539] Extending 2D foundational DINOv3 representations to 3D segmentation of neonatal brain MR images

Annayah Usman, Behraj Khan, Tahir Qasim Syed

Main category: eess.IV

TL;DR: A volumetric segmentation method that uses frozen 2D foundation model features with a structured window-based disassembly-reassembly mechanism for 3D hippocampal segmentation in infant brain MRIs.

DetailsMotivation: Precise 3D hippocampal segmentation is crucial for neurodevelopmental studies in infants, but foundation models trained on 2D visual data are limited for 3D brain anatomy analysis. There's a need to bridge 2D foundation representations with 3D medical imaging tasks.

Method: Proposes a volumetric segmentation strategy with structured window-based disassembly-reassembly: decomposes 3D MRI volumes into non-overlapping 3D windows, processes each with separate decoding arms using frozen 2D foundation features, then reassembles predictions with a dense-prediction head while maintaining anatomical consistency.

Result: Achieves Dice score of 0.65 for single 3D window hippocampal segmentation on ALBERT dataset, demonstrating that volumetric anatomical structures can be recovered from frozen 2D foundation representations through structured compositional decoding.

Conclusion: The method provides a principled, generalizable extension for using 2D foundation models in 3D medical applications, showing promise for bridging 2D visual representations with 3D anatomical analysis.

Abstract: Precise volumetric delineation of hippocampal structures is essential for quantifying neurodevelopmental trajectories in pre-term and term infants, where subtle morphological variations may carry prognostic significance. While foundation encoders trained on large-scale visual data offer discriminative representations, their 2D formulation is a limitation with respect to the $3$D organization of brain anatomy. We propose a volumetric segmentation strategy that reconciles this tension through a structured window-based disassembly-reassembly mechanism: the global MRI volume is decomposed into non-overlapping 3D windows or sub-cubes, each processed via a separate decoding arm built upon frozen high-fidelity features, and subsequently reassembled prior to a ground-truth correspendence using a dense-prediction head. This architecture preserves constant a decoder memory footprint while forcing predictions to lie within an anatomically consistent geometry. Evaluated on the ALBERT dataset for hippocampal segmentation, the proposed approach achieves a Dice score of 0.65 for a single 3D window. The method demonstrates that volumetric anatomical structure could be recovered from frozen 2D foundation representations through structured compositional decoding, and offers a principled and generalizable extension for foundation models for 3D medical applications.

[540] FermatSyn: SAM2-Enhanced Bidirectional Mamba with Isotropic Spiral Scanning for Multi-Modal Medical Image Synthesis

Feng Yuan

Main category: eess.IV

TL;DR: FermatSyn: A medical image synthesis method using SAM2-based anatomical priors, hierarchical residual downsampling, and Fermat spiral scanning with Mamba for global-local consistency and high-fidelity detail.

DetailsMotivation: Address limitations in multi-modal medical image synthesis where existing methods fail to reconcile global anatomical consistency with high-fidelity local detail, particularly for clinical applications with data scarcity.

Method: Three key components: 1) SAM2-based Prior Encoder with LoRA+ fine-tuning for domain-aware anatomical knowledge, 2) Hierarchical Residual Downsampling Module with Cross-scale Integration Network for detail preservation, 3) Fermat Spiral Scanning with Bidirectional Fermat Scan Mamba for isotropic receptive field.

Result: Outperforms state-of-the-art methods on SynthRAD2023, BraTS2019, BraTS-MEN, and BraTS-MET datasets in PSNR, SSIM, FID, and 3D structural consistency. Downstream segmentation shows no significant difference from real-image training.

Conclusion: FermatSyn effectively addresses global-local consistency in medical image synthesis, demonstrating clinical utility through downstream task performance comparable to real data.

Abstract: Multi-modal medical image synthesis is pivotal for alleviating clinical data scarcity, yet existing methods fail to reconcile global anatomical consistency with high-fidelity local detail. We propose FermatSyn, which addresses three persistent limitations: (1)~a SAM2-based Prior Encoder that injects domain-aware anatomical knowledge via Lo-RA$^{+}$ efficient fine-tuning of a frozen SAM2 Vision Transformer; (2)~a Hierarchical Residual Downsampling Module (HRDM) coupled with a Cross-scale Integration Network (CIN) that preserves high-frequency lesion details and adaptively fuses global–local representations; and (3)~a continuity constrained Fermat Spiral Scanning strategy within a Bidirectional Fermat Scan Mamba (BFS-Mamba), constructing an approximately isotropic receptive field that substantially reduces the directional bias of raster or spiral serialization. Experiments on SynthRAD2023, BraTS2019, BraTS-MEN, and BraTS-MET show FermatSyn surpasses state-of-the-art methods in PSNR, SSIM, FID, and 3D structural consistency. Downstream segmentation on synthesized images yields no significant difference from real-image training ($p{>}0.05$), confirming clinical utility. Code will be released upon publication. \keywords{Medical image synthesis \and SAM2 \and Mamba \and Fermat spiral scanning \and Anatomical prior \and Cross-modal}

[541] Less is More: AMBER-AFNO – a New Benchmark for Lightweight 3D Medical Image Segmentation

Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo

Main category: eess.IV

TL;DR: AMBER-AFNO adapts remote sensing model to 3D medical segmentation using Adaptive Fourier Neural Operators instead of self-attention for quasi-linear computational complexity.

DetailsMotivation: To address computational bottlenecks of volumetric transformers in 3D medical image segmentation by replacing quadratic-complexity self-attention with more efficient frequency-domain operations.

Method: Adapts AMBER model from remote sensing to 3D medical segmentation, replacing multi-head self-attention with Adaptive Fourier Neural Operators (AFNO) for global token mixing in frequency domain, achieving quasi-linear complexity.

Result: Achieves state-of-the-art or near-state-of-the-art results on ACDC, Synapse, and BraTS datasets for DSC and HD95 metrics, with higher Dice scores than recent compact CNN/Transformer architectures while maintaining compact model size.

Conclusion: Frequency-domain token mixing with AFNO provides fast and efficient alternative to self-attention for 3D medical image segmentation, reducing computational burden while preserving global context modeling.

Abstract: We adapt the remote sensing-inspired AMBER model from multi-band image segmentation to 3D medical datacube segmentation. To address the computational bottleneck of the volumetric transformer, we propose the AMBER-AFNO architecture. This approach uses Adaptive Fourier Neural Operators (AFNO) instead of the multi-head self-attention mechanism. Unlike spatial pairwise interactions between tokens, global token mixing in the frequency domain avoids $\mathcal{O}(N^2)$ attention-weight calculations. As a result, AMBER-AFNO achieves quasi-linear computational complexity and linear memory scaling. This new way to model global context reduces reliance on dense transformers while preserving global contextual modeling capability. By using attention-free spectral operations, our design offers a compact parameterization and maintains a competitive computational complexity. We evaluate AMBER-AFNO on three public datasets: ACDC, Synapse, and BraTS. On these datasets, the model achieves state-of-the-art or near-state-of-the-art results for DSC and HD95. Compared with recent compact CNN and Transformer architectures, our approach yields higher Dice scores while maintaining a compact model size. Overall, our results show that frequency-domain token mixing with AFNO provides a fast and efficient alternative to self-attention mechanisms for 3D medical image segmentation.

[542] Scale Equivariance Regularization and Feature Lifting in High Dynamic Range Modulo Imaging

Brayan Monroy, Jorge Bacca

Main category: eess.IV

TL;DR: Learning-based HDR restoration framework for modulo imaging using scale-equivariant regularization and feature lifting to distinguish true structure from wrapping artifacts.

DetailsMotivation: Modulo imaging enables HDR acquisition but suffers from reconstruction challenges due to ambiguities between natural image edges and artificial wrap discontinuities, requiring better methods to distinguish true structure from artifacts.

Method: Proposes a learning-based HDR restoration framework with two key strategies: (1) scale-equivariant regularization that enforces consistency under exposure variations, and (2) feature lifting input design combining raw modulo image, wrapped finite differences, and closed-form initialization.

Result: Achieves state-of-the-art performance across both perceptual and linear HDR quality metrics, demonstrating enhanced ability to distinguish true structure from wrapping artifacts.

Conclusion: The proposed framework effectively addresses the challenges of modulo imaging reconstruction through scale-equivariant regularization and feature lifting, yielding superior HDR restoration results.

Abstract: Modulo imaging enables high dynamic range (HDR) acquisition by cyclically wrapping saturated intensities, but accurate reconstruction remains challenging due to ambiguities between natural image edges and artificial wrap discontinuities. This work proposes a learning-based HDR restoration framework that incorporates two key strategies: (i) a scale-equivariant regularization that enforces consistency under exposure variations, and (ii) a feature lifting input design combining the raw modulo image, wrapped finite differences, and a closed-form initialization. Together, these components enhance the network’s ability to distinguish true structure from wrapping artifacts, yielding state-of-the-art performance across perceptual and linear HDR quality metrics.

Last updated: 2026-03-06
Built with Hugo, theme modified on Stack