Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin
Main category: cs.CV
TL;DR: Granulon is a DINOv3-based multimodal LLM with adaptive granularity augmentation that dynamically adjusts visual abstraction levels based on text input for better fine-grained understanding.
Details
Motivation: Current MLLMs rely on CLIP-based visual encoders that focus on global semantic alignment but struggle with fine-grained visual understanding, while DINOv3 provides good pixel-level perception but lacks coarse-grained semantic abstraction, creating a gap in multi-granularity reasoning.Method: Proposes Granulon with: 1) text-conditioned granularity Controller that dynamically adjusts visual abstraction level based on textual input’s semantic scope, and 2) Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens.
Result: Extensive experiments show Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
Conclusion: Granulon enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass, addressing the multi-granularity gap in current MLLMs.
Abstract: Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
Relevance: 9/10
[2] MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning
Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong Mo
Main category: cs.MM
TL;DR: MORE-R1 introduces explicit stepwise reasoning with Reinforcement Learning to enable Large Vision-Language Models to effectively perform Multimodal Object-Entity Relation Extraction, achieving state-of-the-art performance.
Details
Motivation: Existing methods for Multimodal Object-Entity Relation Extraction (MORE) struggle with complex extraction scenarios, limited scalability, and lack of intermediate reasoning transparency. Current approaches are mainly classification-based or generation-based without explicit reasoning capabilities.Method: Proposes MORE-R1 with a two-stage training process: 1) Initial cold-start training with Supervised Fine-Tuning using automatically constructed dataset with fine-grained stepwise reasoning, 2) Reinforcement Learning stage using Group Relative Policy Optimization with Progressive Sample-Mixing Strategy to enhance reasoning on hard samples.
Result: Comprehensive experiments on the MORE benchmark demonstrate state-of-the-art performance with significant improvement over baselines.
Conclusion: MORE-R1 effectively addresses the challenges of MORE task by introducing explicit stepwise reasoning with RL, improving both performance and reasoning transparency.
Abstract: Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model’s reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines.
Relevance: 9/10
[3] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai
Main category: cs.CL
TL;DR: MLLMs perform worse on text presented as images vs. text tokens; systematic study reveals task/data-dependent modality gap influenced by rendering choices; self-distillation method improves visual text understanding significantly.
Details
Motivation: To systematically diagnose why multimodal large language models (MLLMs) perform worse when processing text presented as images compared to when the same content is provided as textual tokens, and to understand the factors contributing to this "modality gap."Method: Evaluated seven MLLMs across seven benchmarks in five input modes, spanning synthetic text renderings and realistic document images. Conducted grounded-theory error analysis of over 4,000 examples. Proposed self-distillation method training models on their own pure text reasoning traces paired with image inputs.
Result: Modality gap is task- and data-dependent (math tasks degrade by over 60 points on synthetic renderings). Rendering choices like font and resolution strongly affect accuracy (font alone swings accuracy by up to 47 percentage points). Image mode selectively amplifies reading errors while leaving knowledge/reasoning errors unchanged. Self-distillation raised image-mode accuracy on GSM8K from 30.71% to 92.72% and transferred to unseen benchmarks.
Conclusion: Provides systematic understanding of modality gap in MLLMs and demonstrates practical path to improve visual text understanding through self-distillation, addressing a key limitation in multimodal language models.
Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this “modality gap” by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 89]
- cs.CV [Total: 226]
- cs.AI [Total: 103]
- cs.SD [Total: 18]
- cs.LG [Total: 159]
- cs.MA [Total: 8]
- cs.MM [Total: 5]
- eess.AS [Total: 19]
- eess.IV [Total: 8]
cs.CL
[1] One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations
Sripad Karne
Main category: cs.CL
TL;DR: SAE features in language models capture semantic meaning rather than orthographic form, as shown by high feature overlap between identical Serbian sentences written in Latin vs Cyrillic scripts despite completely different tokenization.
Details
Motivation: To determine whether Sparse Autoencoder (SAE) features represent abstract meaning or are tied to surface-level text characteristics like orthography, using Serbian digraphia as a controlled testbed where meaning is constant but scripts differ.Method: Used Serbian digraphia (Latin vs Cyrillic scripts with perfect character mapping) to analyze SAE feature activations across Gemma models (270M-27B parameters). Compared feature overlap between identical sentences in different scripts, paraphrases within same script, and cross-script cross-paraphrase combinations.
Result: Identical sentences in different scripts activate highly overlapping SAE features, exceeding random baselines. Script changes cause less representational divergence than paraphrasing within same script. Cross-script cross-paraphrase combinations show substantial feature overlap despite rarely co-occurring in training data. Script invariance strengthens with model scale.
Conclusion: SAE features capture semantics at a level of abstraction above surface tokenization, prioritizing meaning over orthographic form. Serbian digraphia provides a useful evaluation paradigm for probing abstractness of learned representations in language models.
Abstract: Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed: Serbian is written interchangeably in Latin and Cyrillic scripts with a near-perfect character mapping between them, enabling us to vary orthography while holding meaning exactly constant. Crucially, these scripts are tokenized completely differently, sharing no tokens whatsoever. Analyzing SAE feature activations across the Gemma model family (270M-27B parameters), we find that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Strikingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data yet still exhibit substantial feature overlap. This script invariance strengthens with model scale. Taken together, our findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization, and we propose Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations.
[2] MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
Ibrahim Baroud, Christoph Otto, Vera Czehmann, Christine Hovhannisyan, Lisa Raithel, Sebastian Möller, Roland Roller
Main category: cs.CL
TL;DR: Created multilingual medical anonymization benchmark in 10 languages using machine translation to preserve annotations while adapting personal information culturally, enabling privacy-compliant data sharing for healthcare ML.
Details
Motivation: Privacy regulations limit access to real patient data for developing anonymization systems. Synthetic data and translation methods can overcome data scarcity while preserving privacy compliance.Method: Used neural machine translation to create multilingual benchmark in 10 languages, preserving original annotations while culturally adapting personal information (names, cities) to each target language context.
Result: Created benchmark with 2,500+ personal information annotations validated by medical professionals. Translations show high quality in general and for personal information adaptation.
Conclusion: Multilingual anonymization benchmark enables privacy-compliant data sharing, training annotators, validating annotations across institutions, and improving automatic personal information detection systems.
Abstract: Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.
[3] ConFu: Contemplate the Future for Better Speculative Sampling
Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
Main category: cs.CL
TL;DR: ConFu introduces a speculative decoding framework with future prediction capabilities to improve draft model quality and reduce error accumulation in LLM inference acceleration.
Details
Motivation: Existing speculative decoding approaches suffer from error accumulation because draft models only condition on current prefixes, causing predictions to drift from target models over time. The quality of draft models is critical for effective speculative decoding.Method: ConFu introduces: (1) contemplate tokens and soft prompts that allow draft models to leverage future-oriented signals from target models, (2) dynamic contemplate token mechanism with MoE for context-aware future prediction, and (3) training framework with anchor token sampling and future prediction replication for robust learning.
Result: ConFu improves token acceptance rates and generation speed over state-of-the-art EAGLE-3 by 8-11% across various downstream tasks with Llama-3 3B and 8B models.
Conclusion: ConFu bridges speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference by enabling draft models to anticipate future generation directions.
Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
[4] SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
Hexuan Wang, Yaxuan Ren, Srikar Bommireddypalli, Shuxian Chen, Adarsh Prabhudesai, Rongkun Zhou, Elina Baral, Philipp Koehn
Main category: cs.CL
TL;DR: SciTaRC is a benchmark for scientific table reasoning requiring both language understanding and computation, revealing significant gaps in current AI models due to execution bottlenecks.
Details
Motivation: To create a challenging benchmark for evaluating AI models on scientific table reasoning tasks that require both deep language understanding and complex computation, addressing limitations of existing benchmarks.Method: Developed SciTaRC benchmark with expert-authored questions about tabular data in scientific papers, then evaluated state-of-the-art AI models including code-based methods and language models on these tasks.
Result: Current AI models fail on at least 23% of SciTaRC questions, with Llama-3.3-70B-Instruct failing on 65.5% of tasks. Analysis reveals universal “execution bottleneck” where models struggle to faithfully execute plans even with correct strategies.
Conclusion: Scientific table reasoning remains a significant challenge for current AI models, with execution bottlenecks being a key limitation that affects both code-based and language-based approaches.
Abstract: We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal “execution bottleneck”: both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
[5] Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance
Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Joseph Skrovan, Mehak Beri, Hitakshi Modi, Andrew Well, Carlos M. Mery, Yan Zhang, Mia K. Markey, Ying Ding
Main category: cs.CL
TL;DR: Automated thematic analysis framework combining iterative codebook refinement with provenance tracking for scalable and reproducible qualitative analysis in health research.
Details
Motivation: Manual thematic analysis in health research faces scalability and reproducibility challenges, while existing LLM-based automation approaches produce codebooks with limited generalizability and lack analytic auditability.Method: Combines iterative codebook refinement with full provenance tracking to create an automated thematic analysis framework that improves codebook quality through multiple refinement cycles while maintaining complete audit trails.
Result: Achieved highest composite quality score on 4 out of 5 datasets compared to six baselines, with iterative refinement yielding statistically significant improvements on four datasets with large effect sizes, particularly in code reusability and distributional consistency.
Conclusion: The framework provides a scalable, reproducible approach to thematic analysis that maintains analytic auditability while achieving high-quality results comparable to expert annotations in clinical domains.
Abstract: Thematic analysis (TA) is widely used in health research to extract patterns from patient interviews, yet manual TA faces challenges in scalability and reproducibility. LLM-based automation can help, but existing approaches produce codebooks with limited generalizability and lack analytic auditability. We present an automated TA framework combining iterative codebook refinement with full provenance tracking. Evaluated on five corpora spanning clinical interviews, social media, and public transcripts, the framework achieves the highest composite quality score on four of five datasets compared to six baselines. Iterative refinement yields statistically significant improvements on four datasets with large effect sizes, driven by gains in code reusability and distributional consistency while preserving descriptive quality. On two clinical corpora (pediatric cardiology), generated themes align with expert-annotated themes.
[6] Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Marlin, Zhijun Yin
Main category: cs.CL
TL;DR: A confidence-aware decision framework that analyzes single reasoning trajectories to adaptively choose between single-path and multi-path reasoning, reducing token usage by up to 80% while maintaining accuracy comparable to multi-path baselines.
Details
Motivation: LLMs using chain-of-thought reasoning often generate unnecessarily long reasoning paths with high inference costs. Self-consistency approaches improve accuracy but require sampling multiple trajectories, leading to substantial computational overhead.Method: A confidence-aware decision framework that analyzes a single completed reasoning trajectory using sentence-level numeric and linguistic features extracted from intermediate reasoning states. The framework adaptively selects between single-path and multi-path reasoning based on confidence signals.
Result: The method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens. It generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning.
Conclusion: Reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
Abstract: Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
[7] Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai
Main category: cs.CL
TL;DR: MLLMs perform worse on text presented as images vs. text tokens; systematic study reveals task/data-dependent modality gap influenced by rendering choices; self-distillation method improves visual text understanding significantly.
Details
Motivation: To systematically diagnose why multimodal large language models (MLLMs) perform worse when processing text presented as images compared to when the same content is provided as textual tokens, and to understand the factors contributing to this "modality gap."Method: Evaluated seven MLLMs across seven benchmarks in five input modes, spanning synthetic text renderings and realistic document images. Conducted grounded-theory error analysis of over 4,000 examples. Proposed self-distillation method training models on their own pure text reasoning traces paired with image inputs.
Result: Modality gap is task- and data-dependent (math tasks degrade by over 60 points on synthetic renderings). Rendering choices like font and resolution strongly affect accuracy (font alone swings accuracy by up to 47 percentage points). Image mode selectively amplifies reading errors while leaving knowledge/reasoning errors unchanged. Self-distillation raised image-mode accuracy on GSM8K from 30.71% to 92.72% and transferred to unseen benchmarks.
Conclusion: Provides systematic understanding of modality gap in MLLMs and demonstrates practical path to improve visual text understanding through self-distillation, addressing a key limitation in multimodal language models.
Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this “modality gap” by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
[8] Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
Trent R Northen, Mingxun Wang
Main category: cs.CL
TL;DR: Fine-tuning small LLMs with biological literature can reduce synthetic bias and increase preference for biological solutions without degrading general capabilities.
Details
Motivation: LLMs trained on internet-scale data often exhibit systematic biases favoring synthetic over biological technological solutions, which may limit balanced consideration of biological approaches in problem-solving.Method: Used Kelly criterion-inspired framework to measure bioalignment bias in 10 LLMs, then fine-tuned Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct with ~22M tokens from 6,636 PMC articles emphasizing biological problem-solving using QLoRA fine-tuning.
Result: Most models showed synthetic bias; QLoRA fine-tuning significantly increased biological solution preferences for both models (p < 0.001 and p < 0.01) without degrading general capabilities.
Conclusion: Even small fine-tuning can shift LLM preferences toward biological approaches, suggesting potential for developing bio-aligned models; benchmark, corpus, code, and weights released.
Abstract: Large language models (LLMs) trained on internet-scale corpora can exhibit systematic biases that increase the probability of unwanted behavior. In this study, we examined potential biases towards synthetic vs. biological technological solutions across four domains (materials, energy, manufacturing, and algorithms). A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework. According to this metric, most models were not bioaligned in that they exhibit biases in favor of synthetic (non-biological) solutions. We next examined if fine-tuning could increase the preferences of two open-weight models, Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct, for biological-based approaches. A curated corpus of ~22M tokens from 6,636 PMC articles emphasizing biological problem-solving was used first to fine-tune Llama 3B with a mixed corpus of continued training and instruction-formatted. This was then extended to Qwen 3B using instruction-formatted only. We found that QLoRA fine-tuning significantly increased the scoring of biological solutions for both models without degrading general capabilities (Holm-Bonferroni-corrected p < 0.001 and p < 0.01, respectively). This suggests that even a small amount of fine-tuning can change how models weigh the relative value of biological and bioinspired vs. synthetic approaches. Although this work focused on small open-weight LLMs, it may be extensible to much larger models and could be used to develop models that favor bio-based approaches. We release the benchmark, corpus, code, and adapter weights.
[9] DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization
Jianing Yang, Yusuke Fujita, Yui Sudo
Main category: cs.CL
TL;DR: DuplexCascade: A VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue that converts utterance-wise turns into chunk-wise micro-turns, enabling rapid bidirectional exchange while preserving LLM intelligence through conversational control tokens.
Details
Motivation: Current spoken dialog systems face a trade-off: cascaded ASR-LLM-TTS systems with VAD segmentation provide strong LLM intelligence but force half-duplex turns and brittle control, while VAD-free end-to-end models support full-duplex interaction but struggle to maintain conversational intelligence.Method: Proposes DuplexCascade, a VAD-free cascaded streaming pipeline that converts conventional utterance-wise long turns into chunk-wise micro-turn interactions. Introduces conversational special control tokens to reliably coordinate turn-taking and response timing under streaming constraints while preserving the strengths of capable text LLMs.
Result: On Full-DuplexBench and VoiceBench, DuplexCascade achieves state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.
Conclusion: DuplexCascade successfully bridges the gap between cascaded systems’ intelligence and end-to-end systems’ full-duplex capabilities by introducing micro-turn interactions and control tokens, enabling both rapid bidirectional exchange and strong conversational intelligence.
Abstract: Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM’s behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.
[10] DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval
Taegyeong Lee, Jiwon Park, Seunghyun Hwang, JooYoung Jang
Main category: cs.CL
TL;DR: DEO is a training-free method for negation-aware text and multimodal retrieval that decomposes queries into positive/negative components and optimizes embeddings with contrastive objectives.
Details
Motivation: Existing retrieval methods often fail to accurately handle negation and exclusion queries, and prior approaches require embedding adaptation or fine-tuning which adds computational cost and deployment complexity.Method: Direct Embedding Optimization (DEO) decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective, without requiring additional training data or model updates.
Result: DEO outperforms baselines on NegConstraint with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6% over OpenAI CLIP in multimodal retrieval.
Conclusion: DEO demonstrates practicality for negation- and exclusion-aware retrieval in real-world settings as a training-free method that effectively handles complex queries.
Abstract: Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have enabled diverse retrieval methods. However, existing retrieval methods often fail to accurately retrieve results for negation and exclusion queries. To address this limitation, prior approaches rely on embedding adaptation or fine-tuning, which introduce additional computational cost and deployment complexity. We propose Direct Embedding Optimization (DEO), a training-free method for negation-aware text and multimodal retrieval. DEO decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective. Without additional training data or model updates, DEO outperforms baselines on NegConstraint, with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6% over OpenAI CLIP in multimodal retrieval. These results demonstrate the practicality of DEO for negation- and exclusion-aware retrieval in real-world settings.
[11] Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
Benjamin Reichman, Adar Avasian, Samuel Webster, Larry Heck
Main category: cs.CL
TL;DR: Study examines how emotional tone affects transformer attention patterns and reasoning, introduces emotionally-balanced QA dataset, and proposes emotional regularization to improve reading comprehension.
Details
Motivation: LLMs process emotionally varied text but are evaluated without considering emotion as a representational factor. Prior work treats emotion as prediction target rather than studying how it shapes model attention and reasoning.Method: Analyze how emotional tone alters attention geometry in transformers (locality, center-of-mass distance, entropy). Introduce AURA-QA dataset with emotionally balanced passages. Propose emotional regularization framework to constrain emotion-conditioned representational drift during training.
Result: Attention metrics vary across emotions and correlate with QA performance. Emotional regularization improves reading comprehension across multiple QA benchmarks, yielding gains under distribution shift and in-domain improvements.
Conclusion: Emotion systematically shapes transformer attention and reasoning. Accounting for emotional variation through regularization improves model robustness and performance on reading comprehension tasks.
Abstract: Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
[12] SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi Lee
Main category: cs.CL
TL;DR: SPAR-K is a modality-aware early exit framework that accelerates interleaved spoken language model inference by having most speech positions exit at intermediate layers while using periodic full-depth refresh steps to maintain quality.
Details
Motivation: Interleaved spoken language models that generate both text and speech tokens are computationally expensive due to full transformer depth processing for every step, especially for long speech sequences. There's a need to accelerate inference while preserving perceptual quality.Method: Proposes SPAR-K with speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth “refresh” steps mitigate distribution shift from early exit. Evaluated on Step-Audio-2-mini and GLM-4-Voice across reasoning, factual QA, and dialogue datasets.
Result: SPAR-K preserves question-answering accuracy with max 0.82% drop, reduces average speech decoding depth by up to 11% on Step-Audio-2-mini and 5% on GLM-4-Voice, with negligible changes in MOS and WER, and no auxiliary computation overhead. Shows confidence-based early exit strategies from text LLMs are suboptimal for SLMs.
Conclusion: Speech tokens have unique statistical nature requiring specialized early exit design. SPAR-K effectively accelerates interleaved SLM inference while maintaining quality, demonstrating the need for modality-aware optimization approaches.
Abstract: Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth “refresh” steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82% while reducing average speech decoding depth by up to 11% on Step-Audio-2-mini and 5% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.
[13] LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression
Thao Do, Dinh Phu Tran, An Vo, Seon Kwon Kim, Daeyoung Kim
Main category: cs.CL
TL;DR: A margin-based framework for query-driven context pruning in Retrieval Augmented Generation that identifies critical sentences by measuring changes in clue richness when omitted, achieving efficient compression without degrading QA performance.
Details
Motivation: Efficient context compression is crucial for improving accuracy and scalability of question answering in Retrieval Augmented Generation (RAG). Context should be delivered fast, compact, and precise to ensure clue sufficiency while being budget-friendly for LLM readers.Method: Proposes a margin-based framework for query-driven context pruning that identifies sentences critical for answering queries by measuring changes in clue richness when omitted. Uses a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer architecture.
Result: Achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than major baselines. Yields effective compression ratios without degrading answering performance.
Conclusion: The method demonstrates potential as a lightweight and practical alternative for retrieval-augmented tasks, offering efficient context compression while maintaining QA performance.
Abstract: Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.
[14] TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation
Jiashuo Sun, Yixuan Xie, Jimeng Shi, Shaowen Wang, Jiawei Han
Main category: cs.CL
TL;DR: TaSR-RAG: A taxonomy-guided structured reasoning framework for RAG that represents queries/documents as relational triples and performs step-wise evidence selection with explicit entity binding.
Details
Motivation: Current RAG systems retrieve unstructured chunks and rely on one-shot generation, leading to redundant context, low information density, and brittle multi-hop reasoning. Structured RAG pipelines often require costly graph construction or impose rigid entity-centric structures that don't align with query reasoning chains.Method: Represents queries and documents as relational triples with a lightweight two-level taxonomy to constrain entity semantics. Decomposes complex questions into ordered sequences of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching combining semantic similarity over raw triples with structural consistency over typed triples. Maintains an explicit entity binding table across steps.
Result: Outperforms strong RAG and structured-RAG baselines by up to 14% on multiple multi-hop question answering benchmarks, while producing clearer evidence attribution and more faithful reasoning traces.
Conclusion: TaSR-RAG provides an effective structured reasoning framework for RAG that improves multi-hop question answering without requiring explicit graph construction or exhaustive search, offering better evidence selection and reasoning transparency.
Abstract: Retrieval-Augmented Generation (RAG) helps large language models (LLMs) answer knowledge-intensive and time-sensitive questions by conditioning generation on external evidence. However, most RAG systems still retrieve unstructured chunks and rely on one-shot generation, which often yields redundant context, low information density, and brittle multi-hop reasoning. While structured RAG pipelines can improve grounding, they typically require costly and error-prone graph construction or impose rigid entity-centric structures that do not align with the query’s reasoning chain. We propose \textsc{TaSR-RAG}, a taxonomy-guided structured reasoning framework for evidence selection. We represent both queries and documents as relational triples, and constrain entity semantics with a lightweight two-level taxonomy to balance generalization and precision. Given a complex question, \textsc{TaSR-RAG} decomposes it into an ordered sequence of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching that combines semantic similarity over raw triples with structural consistency over typed triples. By maintaining an explicit entity binding table across steps, \textsc{TaSR-RAG} resolves intermediate variables and reduces entity conflation without explicit graph construction or exhaustive search. Experiments on multiple multi-hop question answering benchmarks show that \textsc{TaSR-RAG} consistently outperforms strong RAG and structured-RAG baselines by up to 14%, while producing clearer evidence attribution and more faithful reasoning traces.
[15] Quantifying and extending the coverage of spatial categorization data sets
Wanchun Li, Alexandra Carstensen, Yang Xu, Terry Regier, Charles Kemp
Main category: cs.CL
TL;DR: LLMs can generate spatial relation labels that align with human categorization, helping to extend spatial datasets like TRPS with better scene coverage.
Details
Motivation: To address limitations in existing spatial categorization datasets (TRPS) by leveraging LLMs to generate labels that align with human judgments, enabling more efficient scaling of spatial datasets across languages and scenes.Method: Use LLMs to generate spatial relation labels for scenes, compare with human labels, extend TRPS with 42 new scenes selected using LLM guidance, and evaluate coverage improvement over previous extensions.
Result: LLM-generated labels align well with human labels, and the LLM-guided extension achieves better coverage of possible scene space than previous TRPS extensions.
Conclusion: LLMs can effectively support spatial dataset scaling by generating human-aligned labels and guiding scene selection for better coverage across languages.
Abstract: Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.
[16] Reward Prediction with Factorized World States
Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale Fung
Main category: cs.CL
TL;DR: StateFactory transforms unstructured observations into hierarchical object-attribute representations using language models to enable accurate reward prediction across domains without supervised reward learning biases.
Details
Motivation: Supervised learning of reward models can introduce biases from training data, limiting generalization to novel goals and environments. The paper investigates whether well-defined world state representations alone can enable accurate reward prediction across domains.Method: Introduces StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated as semantic similarity between current state and goal state under hierarchical constraints.
Result: Evaluated on RewardPrediction benchmark (5 domains, 2,454 trajectories). Achieves 60% lower EPIC distance than VLWM-critic and 8% lower than LLM-as-a-Judge reward models. Translates to improved agent planning: +21.64% success rate on AlfWorld and +12.40% on ScienceWorld over reactive policies.
Conclusion: The compact representation structure induced by StateFactory enables strong reward generalization capabilities, successfully translating superior reward quality into improved agent planning performance across diverse domains.
Abstract: Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io
[17] LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
Lukáš Eigler, Jindřich Libovický, David Hurych
Main category: cs.CL
TL;DR: LLM as a Meta-Judge: A framework using LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing expensive human annotations for NLG metric validation.
Details
Motivation: Human annotations for NLG evaluation are expensive, time-consuming, and predominantly exist only for English datasets, limiting multilingual evaluation metric validation.Method: Proposes using LLMs to generate synthetic evaluation datasets through controlled semantic degradation of real data, then validates using meta-correlation between metric rankings from synthetic data and human benchmarks.
Result: Synthetic validation serves as reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA, and proves viable where human judgments are unavailable or too expensive.
Conclusion: LLM-generated synthetic evaluation datasets provide scalable alternative to human annotations for validating NLG metrics, especially valuable for multilingual contexts where human judgments are scarce.
Abstract: Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
[18] Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health
Trung Hieu Ngo, Adrien Bazoge, Solen Quiniou, Pierre-Antoine Gourraud, Emmanuel Morin
Main category: cs.CL
TL;DR: LLMs propagate biases from training data, especially in healthcare. This study examines gender-SDoH interactions in French patient records, finding LLMs use embedded stereotypes for gendered decisions.
Details
Motivation: LLMs often propagate biases from training data, which is particularly concerning in sensitive domains like healthcare. Existing benchmarks evaluate individual social determinants of health (SDoH) but overlook interactions between factors and lack context-specific assessments.Method: The study investigates bias in LLMs by probing relationships between gender and other SDoH in French patient records through a series of experiments examining how embedded stereotypes can be probed using SDoH input.
Result: Found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting interactions among SDoH factors should be evaluated.
Conclusion: Evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias, especially in healthcare contexts.
Abstract: Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.
[19] Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs
Saugata Purkayastha, Pranav Kushare, Pragya Paramita Pal, Sukannya Purkayastha
Main category: cs.CL
TL;DR: LLMs prioritize moral reasoning over commonsense understanding, struggling to detect commonsense contradictions in moral dilemmas, especially when contradictions involve narrator characters.
Details
Motivation: As LLMs are increasingly deployed in real-world applications across diverse communities, it's crucial they maintain both moral grounding and knowledge awareness. The authors identified a critical limitation where LLMs tend to prioritize moral reasoning over commonsense understanding, which could lead to problematic outputs in practical applications.Method: The authors introduced CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. They conducted extensive evaluations of ten LLMs across different model sizes to analyze their ability to identify contradictions without prior signal.
Result: Existing LLMs consistently struggle to identify commonsense contradictions in moral dilemmas. The study also revealed a pervasive narrative focus bias - LLMs more readily detect contradictions when attributed to secondary characters rather than primary (narrator) characters.
Conclusion: The findings underscore the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models, particularly in balancing moral reasoning with factual commonsense understanding.
Abstract: Large Language Models (LLMs) are increasingly deployed across diverse real-world applications and user communities. As such, it is crucial that these models remain both morally grounded and knowledge-aware. In this work, we uncover a critical limitation of current LLMs – their tendency to prioritize moral reasoning over commonsense understanding. To investigate this phenomenon, we introduce CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. Through extensive evaluation of ten LLMs across different model sizes, we find that existing models consistently struggle to identify such contradictions without prior signal. Furthermore, we observe a pervasive narrative focus bias, wherein LLMs more readily detect commonsense contradictions when they are attributed to a secondary character rather than the primary (narrator) character. Our comprehensive analysis underscores the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models.
[20] Modelling the Diachronic Emergence of Phoneme Frequency Distributions
Fermín Moscoso del Prado Martín, Suchir Salhan
Main category: cs.CL
TL;DR: A stochastic model of phonological change shows that statistical regularities in phoneme frequency distributions can emerge from historical processes rather than explicit optimization.
Details
Motivation: Phoneme frequency distributions show consistent statistical patterns across languages, but their origins remain unexplained. The paper investigates whether these patterns arise from historical processes shaping phonological systems rather than from optimization mechanisms.Method: Developed a stochastic model of phonological change to simulate diachronic evolution of phoneme inventories. Started with a naive model, then extended it with two additional assumptions: functional load effects and a stabilizing tendency toward preferred inventory size.
Result: The naive model reproduced general shape of phoneme rank-frequency distributions but failed to capture other empirical properties. The extended model with functional load and inventory size stabilization matched both observed distributions and the negative relationship between inventory size and relative entropy.
Conclusion: Statistical regularities in phonological systems may arise as natural consequences of diachronic sound change rather than from explicit optimization or compensatory mechanisms.
Abstract: Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A naïve version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions – an effect related to functional load and a stabilising tendency toward a preferred inventory size – yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as natural consequences of diachronic sound change rather than from explicit optimisation or compensatory mechanisms.
[21] You Didn’t Have to Say It like That: Subliminal Learning from Faithful Paraphrases
Isaia Gisler, Zhonghao He, Tianyi Qiu
Main category: cs.CL
TL;DR: Subliminal learning enables behavioral trait transmission from teacher to student models via synthetic training data, even through semantically unrelated paraphrases or content contradicting teacher preferences.
Details
Motivation: The paper investigates whether subliminal learning (transmission of behavioral traits via unrelated training data) occurs through natural language paraphrases, and whether contradictory content can block this transmission, addressing concerns about models generating their own training data.Method: Researchers trained student models on paraphrases from teacher models system-prompted to love specific animals, testing transmission through semantically unrelated content and explicitly contradictory expressions of dislike, with aggressive filtering for paraphrase fidelity.
Result: Training on teacher-generated paraphrases increased student preference for the target animal by up to 19 percentage points, even when content was semantically unrelated or explicitly expressed dislike, with transmission succeeding despite aggressive filtering.
Conclusion: Subliminal transmission occurs through natural language paraphrases and cannot be prevented by contradictory content, raising serious concerns for pipelines where models generate their own training data since content-based inspection cannot detect such transmission.
Abstract: When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher’s preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student’s preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.
[22] ALARM: Audio-Language Alignment for Reasoning Models
Petr Grinberg, Hassan Shahmohammadi
Main category: cs.CL
TL;DR: A 4B-parameter audio language model that extends reasoning LLMs with audio understanding using self-rephrasing to handle chain-of-thought traces, multiple fused audio encoders, and training on a large multi-task corpus of speech, music, and sound.
Details
Motivation: Existing approaches for audio language models that freeze LLMs and train only adapters fail for reasoning LLMs because their chain-of-thought traces expose textual surrogate inputs, leading to unnatural responses. There's a need for better integration of audio understanding with reasoning capabilities.Method: Proposes self-rephrasing to convert self-generated responses into audio-understanding variants compatible with reasoning LLMs while preserving distributional alignment. Also fuses and compresses multiple audio encoders for stronger representations. Trains on a 6M-instance multi-task corpus spanning 19K hours of speech, music, and sound.
Result: The 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on audio-reasoning benchmarks. Achieves best open-source result on MMAU-speech and MMSU benchmarks, ranking third among all models. Preserves textual capabilities with low training cost.
Conclusion: The proposed self-rephrasing approach successfully integrates audio understanding with reasoning LLMs, achieving state-of-the-art performance on audio reasoning benchmarks while maintaining textual capabilities efficiently.
Abstract: Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.
[23] Rethinking Discrete Speech Representation Tokens for Accent Generation
Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.19786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[24] Build, Borrow, or Just Fine-Tune? A Political Scientist’s Guide to Choosing NLP Models
Shreyas Meher
Main category: cs.CL
TL;DR: Fine-tuning ModernBERT on terrorism data (Confli-mBERT) performs nearly as well as domain-specific ConfliBERT for common event types, with differences mainly in rare categories, providing a practical framework for NLP model choice in political science.
Details
Motivation: Political scientists need empirical guidance on choosing between building domain-specific models, adapting existing ones, or fine-tuning general models, balancing performance, cost, and expertise requirements.Method: Fine-tuned ModernBERT on Global Terrorism Database to create Confli-mBERT, then systematically compared it against domain-specific ConfliBERT for conflict event classification, analyzing performance across different event frequency categories.
Result: Confli-mBERT achieved 75.46% accuracy vs ConfliBERT’s 79.34%, with performance differences concentrated in rare event categories (<2% of incidents); models were nearly indistinguishable for high-frequency attack types like Bombing/Explosion and Kidnapping.
Conclusion: Model choice should depend on specific intersection of class prevalence, error tolerance, and available resources rather than abstract “better” models, with fine-tuned general models often sufficient for common tasks.
Abstract: Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT’s 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is “better” in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.
[25] Surgical Repair of Collapsed Attention Heads in ALiBi Transformers
Palmer Schallon
Main category: cs.CL
TL;DR: ALiBi positional encoding in BLOOM models causes systematic attention collapse where many heads attend only to beginning-of-sequence token; surgical reinitialization technique recovers head functionality and reveals suboptimal attention configurations.
Details
Motivation: The paper identifies a systematic pathology in BLOOM transformer models where ALiBi positional encoding causes attention heads to collapse, attending almost entirely to the beginning-of-sequence token, which significantly reduces model capacity and performance.Method: Introduces surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single GPU, recovering attention head functionality in two passes.
Result: Recovered 98.7% operational head capacity (242 to 379 of 384 heads). Reinitialization produced a model that transiently outperformed stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting pretrained attention configurations are suboptimal local minima.
Conclusion: ALiBi positional encoding causes systematic attention collapse in BLOOM models, but surgical reinitialization can effectively recover head functionality and reveals that pretrained attention configurations may be suboptimal, opening avenues for model improvement.
Abstract: We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi’s slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization – not corpus content – drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.
[26] GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan Yu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2503.21735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[27] Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models
Luc Builtjes, Alessa Hering
Main category: cs.CL
TL;DR: Open-source LLM pipeline for extracting longitudinal tumor data from radiology reports using Qwen2.5-72B model with high accuracy for clinical oncology applications.
Details
Motivation: Radiology reports contain crucial longitudinal tumor information but are unstructured, and proprietary LLM systems limit applicability in privacy-sensitive healthcare environments.Method: Developed fully open-source, locally deployable pipeline using llm_extractinator framework with qwen2.5-72b model to extract and link target, non-target, and new lesion data across time points according to RECIST criteria.
Result: Evaluation on 50 Dutch CT Thorax/Abdomen report pairs showed high extraction performance: 93.7% accuracy for target lesions, 94.9% for non-target lesions, and 94.0% for new lesions.
Conclusion: Open-source LLMs can achieve clinically meaningful performance in multi-timepoint oncology tasks while ensuring data privacy and reproducibility, highlighting potential for locally deployable LLMs in clinical text analysis.
Abstract: Radiology reports capture crucial longitudinal information on tumor burden, treatment response, and disease progression, yet their unstructured narrative format complicates automated analysis. While large language models (LLMs) have advanced clinical text processing, most state-of-the-art systems remain proprietary, limiting their applicability in privacy-sensitive healthcare environments. We present a fully open-source, locally deployable pipeline for longitudinal information extraction from radiology reports, implemented using the \texttt{llm_extractinator} framework. The system applies the \texttt{qwen2.5-72b} model to extract and link target, non-target, and new lesion data across time points in accordance with RECIST criteria. Evaluation on 50 Dutch CT Thorax/Abdomen report pairs yielded high extraction performance, with attribute-level accuracies of 93.7% for target lesions, 94.9% for non-target lesions, and 94.0% for new lesions. The approach demonstrates that open-source LLMs can achieve clinically meaningful performance in multi-timepoint oncology tasks while ensuring data privacy and reproducibility. These results highlight the potential of locally deployable LLMs for scalable extraction of structured longitudinal data from routine clinical text.
[28] Latent Speech-Text Transformer
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Main category: cs.CL
TL;DR: LST introduces latent speech patches to align speech-text sequence granularity, improving computational efficiency and cross-modal alignment in auto-regressive speech-text models.
Details
Motivation: Current speech-text models suffer from modality imbalance where speech tokens create much longer sequences than text, disproportionately allocating compute to speech and hindering cross-modal alignment and scaling efficiency.Method: Latent Speech-Text Transformer (LST) aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units, aligning sequence-modeling granularity between speech and text while improving computational efficiency.
Result: LST improves speech accuracy (+6.5% on speech HellaSwag) and text performance, with gains growing with scale from 420M to 1.8B parameters. Benefits extend to downstream tasks: stabilizes ASR adaptation and reduces autoregressive sequence length during ASR/TTS inference.
Conclusion: LST effectively addresses modality imbalance in speech-text models through latent speech patches, improving computational efficiency, cross-modal alignment, and scaling performance while maintaining quality.
Abstract: Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The code is available at https://github.com/facebookresearch/lst.
[29] Understanding the Interplay between LLMs’ Utilisation of Parametric and Contextual Knowledge: A keynote at ECIR 2025
Isabelle Augenstein
Main category: cs.CL
TL;DR: This paper examines how language models balance their parametric knowledge (learned during training) with contextual knowledge from provided information, focusing on knowledge conflicts and diagnostic methods.
Details
Motivation: The motivation is to understand how LMs integrate contextual knowledge with their pre-existing parametric knowledge, especially when conflicts arise, and to develop methods for evaluating and addressing these knowledge conflicts without expensive retraining.Method: The paper presents diagnostic tests and evaluation methods to reveal knowledge conflicts between parametric and contextual knowledge, analyzing how LMs use contextual information and identifying characteristics of successfully integrated knowledge.
Result: The research provides insights into knowledge conflicts in LMs, diagnostic tools to identify them, and understanding of when and how LMs successfully use contextual knowledge versus relying on their parametric memory.
Conclusion: Understanding the interplay between parametric and contextual knowledge in LMs is crucial for improving their reliability in knowledge-intensive tasks and developing methods to update knowledge without full retraining.
Abstract: Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model’s inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM’s memory learned during pre-training. Conflicting knowledge can also already be present in the LM’s parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge. In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characteristics of successfully used contextual knowledge.
[30] Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records
Jacopo Vitale, David Della Morte, Luca Bacco, Mario Merone, Mark de Groot, Saskia Haitjema, Leandro Pecchia, Bram van Es
Main category: cs.CL
TL;DR: Automated classification framework using EHRs for geriatric cardiovascular risk management, comparing classical ML, specialized deep learning, and LLMs, with custom Transformer achieving best performance.
Details
Motivation: To overcome limitations of manual administrative coding in geriatric Cardiovascular Risk Management by developing an automated classification framework using unstructured Electronic Health Records.Method: Benchmarked three modeling paradigms on longitudinal Dutch clinical narratives: classical ML baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative LLMs in zero-shot setting. Also evaluated late fusion strategy integrating unstructured text with structured medication embeddings and anthropometric data.
Result: Custom Transformer architecture outperformed both traditional methods and generative LLMs, achieving highest F1-scores and Matthews Correlation Coefficients.
Conclusion: Specialized hierarchical attention mechanisms are critical for capturing long-range dependencies in medical texts, providing robust automated alternative to manual workflows for clinical risk stratification.
Abstract: To overcome the limitations of manual administrative coding in geriatric Cardiovascular Risk Management, this study introduces an automated classification framework leveraging unstructured Electronic Health Records (EHRs). Using a dataset of 3,482 patients, we benchmarked three distinct modeling paradigms on longitudinal Dutch clinical narratives: classical machine learning baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative Large Language Models (LLMs) in a zero-shot setting. Additionally, we evaluated a late fusion strategy to integrate unstructured text with structured medication embeddings and anthropometric data. Our analysis reveals that the custom Transformer architecture outperforms both traditional methods and generative \acs{llm}s, achieving the highest F1-scores and Matthews Correlation Coefficients. These findings underscore the critical role of specialized hierarchical attention mechanisms in capturing long-range dependencies within medical texts, presenting a robust, automated alternative to manual workflows for clinical risk stratification.
[31] Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation
Denica Kjorvezir, Danilo Najkov, Eva Valencič, Erika Jesenko, Barbara Koroišić Seljak, Tome Eftimov, Riste Stojanov
Main category: cs.CL
TL;DR: Paper develops multimodal recipe similarity assessment combining semantic, lexical, and nutritional analysis with expert validation system.
Details
Motivation: Need better methods to assess recipe similarity for applications in food industry, personalized diets, nutrition recommendations, and automated recipe generation systems.Method: Combines semantic, lexical, and domain similarity analysis of ingredients, preparation methods, and nutritional attributes. Developed web-based interface for expert validation of similarity results.
Result: Evaluated 318 recipe pairs with 80% expert agreement (255 pairs). Analysis reveals which similarity aspects (lexical, semantic, or nutritional) most influence expert decisions.
Conclusion: Multimodal similarity assessment methods have broad applications in food industry and support development of personalized nutrition systems and automated recipe generation.
Abstract: This research focuses on developing advanced methods for assessing similarity between recipes by combining different sources of information and analytical approaches. We explore the semantic, lexical, and domain similarity of food recipes, evaluated through the analysis of ingredients, preparation methods, and nutritional attributes. A web-based interface was developed to allow domain experts to validate the combined similarity results. After evaluating 318 recipe pairs, experts agreed on 255 (80%). The evaluation of expert assessments enables the estimation of which similarity aspects–lexical, semantic, or nutritional–are most influential in expert decision-making. The application of these methods has broad implications in the food industry and supports the development of personalized diets, nutrition recommendations, and automated recipe generation systems.
[32] ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
Dechuan Teng, Chunlin Lu, Libo Qin, Wanxiang Che
Main category: cs.CL
TL;DR: ESAinsTOD: A unified end-to-end schema-aware instruction-tuning framework for task-oriented dialog systems that enables flexible adaptation to various dialog scenarios through instruction and schema alignment mechanisms.
Details
Motivation: Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. There's a need for a more generalizable framework that can handle various dialogue task flows and schemas.Method: Proposes ESAinsTOD framework with full-parameter fine-tuning of LLMs and two alignment mechanisms: (1) instruction alignment to ensure faithful task instruction following, and (2) schema alignment to encourage predictions adhering to specified schemas. Uses session-level end-to-end modeling to access previous task flow results from dialogue history.
Result: Outperforms state-of-the-art models on CamRest676, In-Car and MultiWOZ benchmarks; exhibits superior generalization in low-resource settings with enhanced zero-shot performance; improves robustness against data noise and cascading errors.
Conclusion: The structured instruction-tuning approach with alignment mechanisms provides significant benefits over simple LLM fine-tuning for task-oriented dialog systems, enabling better generalization, robustness, and adaptation to various dialog scenarios.
Abstract: Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make predictions adhering to the specified schema. In addition, we employ session-level end-to-end modeling, which allows the system to access the results of previously executed task flows within the dialogue history, to bridge the gap between the instruction-tuning paradigm and the real-world application of TOD systems. Empirical results show that while a fine-tuned LLM serves as a strong baseline, our structured approach provides significant additional benefits. In particular, our findings indicate that: (i) ESAinsTOD outperforms state-of-the-art models by a significant margin on end-to-end task-oriented dialog modeling benchmarks: CamRest676, In-Car and MultiWOZ; (ii) more importantly, it exhibits superior generalization capabilities across various low-resource settings, with the proposed alignment mechanisms significantly enhancing zero-shot performance; and (iii) our instruction-tuning paradigm substantially improves the model’s robustness against data noise and cascading errors.
[33] Evaluation of LLMs in retrieving food and nutritional context for RAG systems
Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušić Seljak
Main category: cs.CL
TL;DR: LLMs can effectively translate natural language queries into structured metadata filters for food composition database retrieval, but struggle with queries involving non-expressible constraints beyond metadata format scope.
Details
Motivation: To evaluate LLMs' effectiveness in specialized RAG systems for food composition databases, reducing manual effort and technical expertise required for domain experts like nutritionists to access complex food data.Method: Evaluated four LLMs on their ability to translate natural language queries into structured metadata filters for retrieval via a Chroma vector database, testing on easy, moderate, and difficult queries.
Result: LLMs achieved high accuracy in retrieval for easy and moderately complex queries, demonstrating they can serve as accessible, high-performance tools. However, difficult queries involving non-expressible constraints remained challenging.
Conclusion: LLM-driven metadata filtering excels when constraints can be explicitly expressed in the metadata format, but struggles when queries exceed the representational scope of the metadata, revealing limitations in handling complex, non-expressible constraints.
Abstract: In this article, we evaluate four Large Language Models (LLMs) and their effectiveness at retrieving data within a specialized Retrieval-Augmented Generation (RAG) system, using a comprehensive food composition database. Our method is focused on the LLMs ability to translate natural language queries into structured metadata filters, enabling efficient retrieval via a Chroma vector database. By achieving high accuracy in this critical retrieval step, we demonstrate that LLMs can serve as an accessible, high-performance tool, drastically reducing the manual effort and technical expertise previously required for domain experts, such as food compilers and nutritionists, to leverage complex food and nutrition data. However, despite the high performance on easy and moderately complex queries, our analysis of difficult questions reveals that reliable retrieval remains challenging when queries involve non-expressible constraints. These findings demonstrate that LLM-driven metadata filtering excels when constraints can be explicitly expressed, but struggles when queries exceed the representational scope of the metadata format.
[34] RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
Sihong Wu, Yiling Ma, Yilun Zhao, Tiansheng Hu, Owen Jiang, Manasi Patwardhan, Arman Cohan
Main category: cs.CL
TL;DR: RbtAct is a system that generates actionable peer-review feedback by learning from rebuttals, using perspective-conditioned segment-level generation and preference optimization.
Details
Motivation: Current AI-generated peer-review reports are often superficial and lack actionable guidance for authors. The paper addresses this gap by leveraging rebuttals as implicit supervision to learn what makes feedback actionable.Method: Proposes perspective-conditioned segment-level review feedback generation, creates RMR-75K dataset mapping review segments to rebuttal segments with perspective labels and impact categories, and trains Llama-3.1-8B-Instruct with supervised fine-tuning followed by preference optimization using rebuttal-derived pairs.
Result: Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.
Conclusion: RbtAct successfully improves the actionability of AI-generated peer-review feedback by learning from rebuttals, providing more concrete and implementable guidance to authors.
Abstract: Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.
[35] Beyond Fine-Tuning: Robust Food Entity Linking under Ontology Drift with FoodOntoRAG
Jan Drole, Ana Gjorgjevikj, Barbara Korouši’c Seljak, Tome Eftimov
Main category: cs.CL
TL;DR: FoodOntoRAG: A retrieval-augmented generation pipeline for few-shot named entity linking in food/nutrition domains without fine-tuning, using hybrid retrieval and LLM agents for interpretable ontology matching.
Details
Motivation: Current NEL approaches in food/nutrition domains require fine-tuning LLMs, which is computationally expensive, ties models to specific ontology versions, and degrades with ontology drift. Need for more flexible, interpretable, and ontology-agnostic solutions.Method: FoodOntoRAG uses a hybrid lexical-semantic retriever to enumerate candidate entities from domain ontologies, then employs multiple LLM agents: a selector agent chooses best matches with rationale, a scorer agent calibrates confidence, and a synonym generator agent proposes reformulations when confidence is low. The pipeline conditions LLMs on structured evidence (labels, synonyms, definitions, relations) without fine-tuning.
Result: The pipeline approaches state-of-the-art accuracy while revealing gaps/inconsistencies in existing annotations. It avoids fine-tuning costs, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.
Conclusion: FoodOntoRAG provides an effective, model- and ontology-agnostic approach for food entity linking that addresses limitations of fine-tuning-based methods, offering better adaptability to ontology changes and interpretable results.
Abstract: Standardizing food terms from product labels and menus into ontology concepts is a prerequisite for trustworthy dietary assessment and safety reporting. The dominant approach to Named Entity Linking (NEL) in the food and nutrition domains fine-tunes Large Language Models (LLMs) on task-specific corpora. Although effective, fine-tuning incurs substantial computational cost, ties models to a particular ontology snapshot (i.e., version), and degrades under ontology drift. This paper presents FoodOntoRAG, a model- and ontology-agnostic pipeline that performs few-shot NEL by retrieving candidate entities from domain ontologies and conditioning an LLM on structured evidence (food labels, synonyms, definitions, and relations). A hybrid lexical–semantic retriever enumerates candidates; a selector agent chooses a best match with rationale; a separate scorer agent calibrates confidence; and, when confidence falls below a threshold, a synonym generator agent proposes reformulations to re-enter the loop. The pipeline approaches state-of-the-art accuracy while revealing gaps and inconsistencies in existing annotations. The design avoids fine-tuning, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.
[36] EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting
Maria Kunilovskaya, Christina Pollkläsener
Main category: cs.CL
TL;DR: Updated combined corpus of European Parliament speeches with translations/interpretations, featuring corrections, new annotations (word alignment, surprisal indices), supporting research on language variation, translation studies, and disfluency analysis.
Details
Motivation: To provide an improved multilingual corpus resource for studying language variation between written and spoken modes, disfluencies in speech, and translation phenomena, with enhanced annotations and corrected errors from previous versions.Method: Combined EPIC-UdS (spoken) and EuroParl-UdS (written) corpora, corrected metadata/text errors, updated linguistic annotations, added word alignment and word-level surprisal indices derived from GPT-2 and machine translation models.
Result: Created a comprehensive bilingual corpus with improved quality and new analytical layers, validated data integrity, and demonstrated utility through a study on filler particle prediction using probabilistic measures from language models.
Conclusion: The updated combined corpus provides a valuable resource for information-theoretic approaches to language variation research, particularly for comparing written/spoken modes and studying disfluencies and translation phenomena.
Abstract: This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.
[37] One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang
Main category: cs.CL
TL;DR: One-Eval is an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows for large language models, reducing manual effort in benchmark identification, code reproduction, and metric interpretation.
Details
Motivation: Current LLM evaluation requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. This manual process is inefficient and hinders reliable evaluation.Method: One-Eval integrates three components: (1) NL2Bench for intent structuring and personalized benchmark planning, (2) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization, and (3) Metrics & Reporting for task-aware metric selection and decision-oriented reporting. The system includes human-in-the-loop checkpoints for review/editing and preserves sample evidence trails for debugging.
Result: Experiments show One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings.
Conclusion: One-Eval addresses the challenges of manual LLM evaluation by providing an agentic system that automates benchmark selection, dataset acquisition, and metric reporting while maintaining traceability and customizability.
Abstract: Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics & Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.
[38] Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
Naman Gupta, Vaibhav Singh, Arun Iyer, Kirankumar Shiragur, Pratham Grover, Ramakrishna B. Bairi, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta, Rishikesh Maurya, Vageesh D. C
Main category: cs.CL
TL;DR: CoA framework for long-context reasoning suffers from information loss due to chunk ordering; using Chow-Liu trees to learn dependency structures and breadth-first traversal improves performance over baseline ordering methods.
Details
Motivation: Sequential multi-agent reasoning frameworks like Chain-of-Agents process long contexts by chunking, but the order of chunk processing creates information bottlenecks and affects reasoning quality. The paper aims to optimize chunk ordering to minimize information loss.Method: Use Chow-Liu trees to learn dependency structures between chunks, then apply breadth-first traversal of the resulting tree to determine optimal chunk ordering that prioritizes strongly related chunks.
Result: The proposed chunk ordering method consistently outperforms default document-chunk ordering and semantic score-based ordering across three long-context benchmarks in terms of answer relevance and exact-match accuracy.
Conclusion: Optimizing chunk ordering through learned dependency structures significantly improves sequential multi-agent reasoning performance by reducing information loss across agents in long-context processing.
Abstract: Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks.
[39] N-gram-like Language Models Predict Reading Time Best
James A. Michaelov, Roger P. Levy
Main category: cs.CL
TL;DR: Transformer language models become too good at next-word prediction, making their probabilities less correlated with human reading times, which are better predicted by simpler n-gram statistics.
Details
Motivation: The paper addresses the counterintuitive finding that as language models improve at next-word prediction, their probability estimates become less correlated with human reading times measured through eye-tracking. This suggests a disconnect between what makes language models accurate and what influences human reading behavior.Method: The authors analyze the relationship between language model probabilities and reading time metrics. They compare different neural language models and examine which models’ predictions correlate best with n-gram statistics. They then test whether models whose predictions align more with n-gram statistics also show stronger correlations with eye-tracking-based reading time measurements.
Result: The research demonstrates that neural language models whose probability predictions are most correlated with simple n-gram statistics are also those whose probabilities show the strongest correlation with human reading times. This suggests that human reading time is more sensitive to basic n-gram statistics than to the complex statistical patterns learned by state-of-the-art transformer models.
Conclusion: The paradox of improved language models showing worse correlation with reading time can be explained by human reading behavior being more aligned with simple n-gram statistics rather than the sophisticated patterns captured by advanced transformer models. This has implications for understanding both human language processing and the nature of language model learning.
Abstract: Recent work has found that contemporary language models such as transformers can become so good at next-word prediction that the probabilities they calculate become worse for predicting reading time. In this paper, we propose that this can be explained by reading time being sensitive to simple n-gram statistics rather than the more complex statistics learned by state-of-the-art transformer language models. We demonstrate that the neural language models whose predictions are most correlated with n-gram probability are also those that calculate probabilities that are the most correlated with eye-tracking-based metrics of reading time on naturalistic text.
[40] Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan Niehues
Main category: cs.CL
TL;DR: DOWIS is a multilingual dataset of human-recorded spoken and written prompts for realistic evaluation of Speech Large Language Models across 9 tasks and 11 languages, revealing text prompts outperform spoken prompts except for speech output tasks.
Details
Motivation: Current SLLM evaluations use text prompts, which don't reflect real-world speech interactions. There's a need for realistic evaluation under spoken instruction conditions to better assess model performance in practical scenarios.Method: Created DOWIS dataset with human-recorded spoken and written prompts spanning 9 tasks, 11 languages, 10 prompt variants per task-language pair across five styles. Used this to benchmark state-of-the-art SLLMs, analyzing interplay between prompt modality, style, language, and task type.
Result: Text prompts consistently outperform spoken prompts, especially for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the performance gap, highlighting the importance of speech-based prompting in SLLM evaluation.
Conclusion: The DOWIS dataset enables realistic evaluation of SLLMs under spoken instruction conditions, revealing significant performance differences between text and speech prompts that current evaluation methods miss, emphasizing the need for speech-based evaluation protocols.
Abstract: Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
[41] Benchmarking Political Persuasion Risks Across Frontier Large Language Models
Zhongren Chen, Joshua Kalla, Quan Le
Main category: cs.CL
TL;DR: LLMs outperform standard political campaign ads in persuasion, with Claude models being most persuasive and Grok least persuasive; information-based prompts have model-dependent effects on persuasiveness.
Details
Motivation: To address concerns about LLMs' capacity to sway political views and evaluate whether frontier models are more persuasive than standard political campaign practices, given the rise of advanced models.Method: Two survey experiments (N=19,145) across bipartisan issues and stances evaluating seven state-of-the-art LLMs from Anthropic, OpenAI, Google, and xAI; introduced data-driven, strategy-agnostic LLM-assisted conversation analysis to identify persuasive strategies.
Result: LLMs outperform standard campaign advertisements, with heterogeneity across models (Claude highest, Grok lowest); information-based prompts have model-dependent effects (increase persuasiveness for Claude and Grok, reduce for GPT).
Conclusion: Frontier LLMs pose persuasive risks exceeding standard political campaigns, with model-specific variations; provides framework for cross-model comparative risk assessment of persuasive capabilities.
Abstract: Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.
[42] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, Jonathan Herzig
Main category: cs.CL
TL;DR: LLM reasoning improves performance on simple factual questions through computational buffering and factual priming mechanisms, despite no complex reasoning being required.
Details
Motivation: To understand why reasoning in LLMs helps with simple, single-hop factual questions when no complex reasoning steps are needed, and to identify the underlying mechanisms.Method: Designed hypothesis-driven controlled experiments to test different mechanisms, analyzing how reasoning tokens affect knowledge recall and identifying computational buffer effects and factual priming.
Result: Found two key mechanisms: computational buffer effect (latent computation using reasoning tokens) and factual priming (topically related facts as semantic bridges). Also discovered that hallucinated intermediate facts increase final answer hallucinations.
Conclusion: Reasoning aids parametric knowledge recall through computational and semantic mechanisms, but carries hallucination risks that can be mitigated by prioritizing reasoning trajectories with hallucination-free factual statements.
Abstract: While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model’s parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
[43] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
Mingyang Song, Mao Zheng
Main category: cs.CL
TL;DR: A comprehensive survey of model merging techniques for large language models, presenting the FUSE taxonomy covering foundations, unification strategies, scenarios, and ecosystem.
Details
Motivation: Model merging has become important for combining capabilities of multiple neural networks without additional training, offering efficient alternatives to ensembles and retraining for LLMs.Method: Presents the FUSE taxonomy: Foundations (theoretical underpinnings), Unification Strategies (algorithmic approaches), Scenarios (applications), and Ecosystem (tools/benchmarks). Reviews weight averaging, task vector arithmetic, sparsification, mixture-of-experts, and evolutionary optimization.
Result: Systematic examination of model merging landscape, covering theoretical foundations, algorithmic approaches, practical applications, and supporting ecosystem tools.
Conclusion: Provides structured foundation for advancing model merging research, identifies open challenges including theoretical gaps, scalability barriers, and standardization needs.
Abstract: Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey presents a comprehensive and structured examination of model merging in the LLM era through the \textbf{FUSE} taxonomy, a four-dimensional framework organized along \textbf{F}oundations, \textbf{U}nification Strategies, \textbf{S}cenarios, and \textbf{E}cosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry, mode connectivity, and the linear mode connectivity hypothesis. We then systematically review the algorithmic landscape, spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. For each method family, we analyze the core formulation, highlight representative works, and discuss practical trade-offs. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning. Finally, we survey the supporting ecosystem of open-source tools, community platforms, and evaluation benchmarks, and identify key open challenges including theoretical gaps, scalability barriers, and standardization needs. This survey aims to equip researchers and practitioners with a structured foundation for advancing model merging.
[44] CREATE: Testing LLMs for Associative Creativity
Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett
Main category: cs.CL
TL;DR: CREATE benchmark evaluates LLMs’ creative associative reasoning by requiring them to generate diverse, specific concept connection paths between concepts, measuring creativity through path specificity and diversity.
Details
Motivation: The paper aims to address the challenge of evaluating creative associative reasoning in AI models, which is a key component of human creativity involving drawing novel yet meaningful connections between concepts. Current benchmarks don't adequately capture this aspect of creativity.Method: CREATE benchmark requires models to generate sets of paths connecting concepts in their parametric knowledge. Paths are evaluated on specificity (distinctiveness and closeness of concept connection) and diversity (dissimilarity from other paths). Models are scored based on producing larger sets of strong, diverse paths.
Result: Evaluation of frontier models shows strongest models achieve higher creative utility, but benchmark saturation is difficult due to high multiplicity of answers and search complexity. Thinking models aren’t always more effective even with high token budgets, and creative prompting gives limited improvement.
Conclusion: CREATE provides a valuable sandbox for developing methods to improve models’ capacity for associative creativity, offering objective evaluation of creative reasoning abilities in AI systems.
Abstract: A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models’ capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model’s parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models’ capacity for associative creativity.
[45] Let’s Verify Math Questions Step by Step
Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Main category: cs.CL
TL;DR: MathQ-Verify is a five-stage pipeline for filtering ill-posed or under-specified math problems in LLM training data, improving dataset quality for mathematical reasoning.
Details
Motivation: Current approaches focus on generating correct reasoning paths and answers but overlook the validity of questions themselves, leading to unreliable mathematical datasets with ill-posed or under-specified problems.Method: A five-stage pipeline: 1) format-level validation, 2) question formalization and decomposition into atomic conditions, 3) verification against mathematical definitions, 4) logical contradiction detection, and 5) goal-oriented completeness check.
Result: Achieves state-of-the-art performance across benchmarks, improving F1 score by up to 25 percentage points over baselines, with ~90% precision and 63% recall using model voting.
Conclusion: MathQ-Verify provides a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding computation on invalid questions.
Abstract: Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.
[46] Correspondence Analysis and PMI-Based Word Embeddings: A Comparative Study
Qianqian Qi, Ayoub Bagheri, David J. Hessen, Peter G. M. van der Heijden
Main category: cs.CL
TL;DR: Correspondence analysis (CA) variants (ROOT-CA and ROOTROOT-CA) applied to word-context matrices perform slightly better than standard PMI-based word embeddings and achieve results competitive with BERT on word-similarity benchmarks.
Details
Motivation: To establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods, and to introduce variants of CA that can handle extreme values in word-context matrices more effectively.Method: Mathematically connects CA to weighted factorization of PMI matrix, introduces CA variants with square-root (ROOT-CA) and fourth-root (ROOTROOT-CA) transformations applied to word-context matrices before SVD-based dimensionality reduction.
Result: ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods across multiple corpora and word-similarity benchmarks, achieving results competitive with BERT despite being traditional static embeddings.
Conclusion: CA variants with root transformations offer improved performance over standard PMI-based word embeddings and can compete with transformer-based contextual embeddings on word-similarity tasks, while being more robust to extreme values in the decomposed matrix.
Abstract: Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we establish a formal connection between correspondence analysis (CA) and PMI-based word embedding methods. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. We further introduce variants of CA for word-context matrices, namely CA applied after a square-root transformation (ROOT-CA) and after a fourth-root transformation (ROOTROOT-CA). We analyze the performance of these methods and examine how their success or failure is influenced by extreme values in the decomposed matrix. Although our primary focus is on traditionalstatic word embedding methods, we also include a comparison with a transformer-based encoder (BERT) to situate the results relative to contextual embeddings. Empirical evaluations across multiple corpora and word-similarity benchmarks show that ROOT-CA and ROOTROOT-CA perform slightly better overall than standard PMI-based methods and achieve results competitive with BERT.
[47] MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs
Xinxin You, Xien Liu, Xue Yang, Ziyi Wang, Ji Wu
Main category: cs.CL
TL;DR: MKE-Coder: A framework for automatic ICD coding of Chinese electronic medical records using multi-axial knowledge and evidence verification
Details
Motivation: Automatic ICD coding works well for English medical records but faces challenges with Chinese EMRs due to concise writing styles, specific internal structures, and previous methods' failure to leverage disease-based multi-axial knowledge and clinical evidence associations.Method: 1) Identify candidate diagnosis codes and categorize them into four coding axes knowledge; 2) Retrieve clinical evidence from EMRs and filter credible evidence via scoring model; 3) Use masked language modeling inference module to verify axis knowledge is evidence-supported and provide recommendations.
Result: MKE-Coder shows significant superiority in automatic ICD coding for Chinese EMRs on large-scale dataset from various hospitals. In simulated real coding scenarios, it significantly aids coders in improving both coding accuracy and speed.
Conclusion: The proposed MKE-Coder framework effectively addresses challenges in Chinese EMR ICD coding by integrating multi-axial knowledge with evidence verification, demonstrating practical utility in real-world medical coding scenarios.
Abstract: The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.
[48] Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus
Samy Ouzerrout
Main category: cs.CL
TL;DR: LoReSpeech: A methodology for creating low-resource speech-to-speech translation corpora for underrepresented languages using collaborative platforms and alignment tools.
Details
Motivation: Aligned audio corpora are scarce for underrepresented languages, hindering ASR and speech translation technologies and limiting digital inclusivity for these language communities.Method: 1) Create LoReASR sub-corpus of short audios aligned with transcriptions using collaborative platforms. 2) Align long-form audio recordings (e.g., biblical texts) using tools like MFA (Montreal Forced Aligner). 3) Build LoReSpeech corpus with both intra- and inter-language alignments.
Result: LoReSpeech delivers aligned speech corpora enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation for underrepresented languages.
Conclusion: The methodology addresses the scarcity of aligned audio resources for low-resource languages, fostering digital inclusivity and supporting technological integration of underrepresented languages through the Tutlayt AI project.
Abstract: Aligned audio corpora are fundamental to NLP technologies such as ASR and speech translation, yet they remain scarce for underrepresented languages, hindering their technological integration. This paper introduces a methodology for constructing LoReSpeech, a low-resource speech-to-speech translation corpus. Our approach begins with LoReASR, a sub-corpus of short audios aligned with their transcriptions, created through a collaborative platform. Building on LoReASR, long-form audio recordings, such as biblical texts, are aligned using tools like the MFA. LoReSpeech delivers both intra- and inter-language alignments, enabling advancements in multilingual ASR systems, direct speech-to-speech translation models, and linguistic preservation efforts, while fostering digital inclusivity. This work is conducted within Tutlayt AI project (https://tutlayt.fr).
[49] UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang
Main category: cs.CL
TL;DR: UltraEdit is a scalable lifelong model editing approach for LLMs that computes parameter shifts in one step using hidden states and gradients, achieving 7x faster editing with 4x less VRAM than SOTA.
Details
Motivation: Current model editing methods struggle with practical lifelong adaptation at scale, lacking efficiency and scalability needed for real-world deployment. There's a need for approaches that can handle massive numbers of edits while preserving existing capabilities.Method: UltraEdit uses a training-, subject-, and memory-free approach that computes parameter shifts in one step using only a hidden state and its gradient. It employs lifelong normalization strategy that continuously updates feature statistics across turns to adapt to distributional shifts.
Result: Achieves 7x faster editing speed and 4x less VRAM than previous SOTA, enabling editing of 7B LLM on 24GB consumer GPU. Supports up to 2M edits while maintaining high accuracy, demonstrated on UltraEditBench (2M+ editing pairs) across 5 datasets and 6 models.
Conclusion: UltraEdit represents a significant step toward safe and scalable lifelong learning for LLMs, offering practical efficiency for real-world deployment while maintaining editing accuracy at massive scale.
Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than $7\times$ faster than the previous state-of-the-art method, while requiring $4\times$ less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at https://github.com/XiaojieGu/UltraEdit.
[50] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut
Main category: cs.CL
TL;DR: Novel supervised contrastive learning approach for language identification that improves performance on out-of-domain data for low-resource languages by 3.2 percentage points while maintaining performance for high-resource languages.
Details
Motivation: Current LID models perform poorly on low-resource languages due to limited training data (often single-domain like the Bible), creating imbalance and bias issues that affect multilingual LLM pretraining corpus curation.Method: Proposes supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages, addressing the domain bias problem in language identification.
Result: Improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points while maintaining performance for high-resource languages.
Conclusion: The SCL approach effectively addresses domain bias in language identification, particularly benefiting low-resource languages and improving multilingual corpus curation for LLM pretraining.
Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages – often limited to single-domain data, such as the Bible – continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.
[51] OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering
Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim
Main category: cs.CL
TL;DR: OPENXRD is a benchmarking framework for evaluating LLMs and MLLMs on crystallography QA, measuring how models use domain-specific context during inference through 217 expert-curated XRD questions.
Details
Motivation: There's a need to evaluate how large language models and multimodal models assimilate domain-specific scientific knowledge, particularly in crystallography, to understand their reasoning capabilities and knowledge integration in specialized domains.Method: Created 217 expert-curated XRD questions covering fundamental to advanced concepts, evaluated under closed-book (no context) and open-book (with context) conditions. Context includes GPT-4.5 generated passages refined by experts. Benchmarked 74 state-of-the-art LLMs and MLLMs including GPT, LLaVA, LLaMA, QWEN, Mistral, and Gemini families.
Result: Mid-sized models (7B-70B parameters) gain most from contextual materials, while very large models show saturation or interference. Expert-reviewed materials provide significantly higher improvements than AI-generated ones even with matched token counts, showing content quality drives performance.
Conclusion: OPENXRD provides a reproducible diagnostic benchmark for assessing reasoning, knowledge integration, and guidance sensitivity in scientific domains, offering foundation for future multimodal and retrieval-augmented crystallography systems.
Abstract: We introduce OPENXRD, a comprehensive benchmarking framework for evaluating large language models (LLMs) and multimodal LLMs (MLLMs) in crystallography question answering. The framework measures context assimilation, or how models use fixed, domain-specific supporting information during inference. The framework includes 217 expert-curated X-ray diffraction (XRD) questions covering fundamental to advanced crystallographic concepts, each evaluated under closed-book (without context) and open-book (with context) conditions, where the latter includes concise reference passages generated by GPT-4.5 and refined by crystallography experts. We benchmark 74 state-of-the-art LLMs and MLLMs, including GPT-4, GPT-5, O-series, LLaVA, LLaMA, QWEN, Mistral, and Gemini families, to quantify how different architectures and scales assimilate external knowledge. Results show that mid-sized models (7B–70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models. Expert-reviewed materials provide significantly higher improvements than AI-generated ones even when token counts are matched, confirming that content quality, not quantity, drives performance. OPENXRD offers a reproducible diagnostic benchmark for assessing reasoning, knowledge integration, and guidance sensitivity in scientific domains, and provides a foundation for future multimodal and retrieval-augmented crystallography systems.
[52] AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei
Main category: cs.CL
TL;DR: AgentCoMa benchmark tests LLMs on compositional tasks requiring both commonsense and math reasoning, revealing significant performance drop when combining different reasoning types compared to solving steps individually.
Details
Motivation: Current compositional benchmarks focus on either commonsense OR math reasoning, but real-world LLM agents need to combine both. There's a need to test LLMs on mixed-type compositional reasoning to understand their limitations.Method: Created Agentic Commonsense and Math benchmark (AgentCoMa) with tasks requiring both reasoning types. Tested 61 LLMs of various sizes, families, and training strategies. Conducted interpretability studies examining neuron patterns, attention maps, and membership inference.
Result: LLMs can usually solve both steps in isolation, but accuracy drops by ~30% on average when combining commonsense and math reasoning. This performance gap is substantially greater than in prior compositional benchmarks combining same-type reasoning steps. Non-expert humans solve compositional questions and individual steps with similarly high accuracy.
Conclusion: LLMs show substantial brittleness in mixed-type compositional reasoning, highlighting a key limitation for real-world agent applications. AgentCoMa provides a test bed for future improvements in multimodal reasoning capabilities.
Abstract: Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by ~30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.
[53] When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He
Main category: cs.CL
TL;DR: Reasoning-Induced Misalignment (RIM) is a vulnerability where enhanced reasoning capabilities in LLMs cause safety misalignment, with mechanistic origins in attention heads and neuron entanglement.
Details
Motivation: As large language models become more accessible and widely adopted, concerns about their safety and alignment with human values have become critical. The paper identifies a concerning phenomenon where improved reasoning capabilities can actually lead to safety misalignment.Method: The authors use representation analysis to study attention heads that facilitate refusal by reducing attention to Chain-of-Thought tokens. They examine activation entanglement between reasoning and safety in safety-critical neurons versus control neurons, particularly after fine-tuning with specific reasoning patterns.
Result: The study reveals that specific attention heads modulate refusal behavior by reducing attention to CoT tokens. During training, safety-critical neurons show significantly higher activation entanglement between reasoning and safety than control neurons, strongly correlating with catastrophic forgetting of safety alignment.
Conclusion: Reasoning-Induced Misalignment is a real vulnerability where enhanced reasoning capabilities can compromise safety alignment. The mechanistic explanation involves attention head behavior and neuron-level entanglement, providing insights into how reasoning patterns can interfere with safety mechanisms in LLMs.
Abstract: With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model’s rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.
[54] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, Dipanjan Das
Main category: cs.CL
TL;DR: SimpleQA Verified is a 1,000-prompt benchmark for evaluating LLM factuality that improves upon OpenAI’s SimpleQA by addressing label noise, topical biases, and redundancy through rigorous filtering.
Details
Motivation: OpenAI's SimpleQA benchmark has limitations including noisy/incorrect labels, topical biases, and question redundancy, which undermine its reliability for evaluating LLM factuality and tracking genuine progress in reducing hallucinations.Method: Created through multi-stage filtering: de-duplication to remove redundant questions, topic balancing to ensure diverse coverage, source reconciliation to verify correctness, and improvements to the autorater prompt for more accurate evaluation.
Result: Gemini 2.5 Pro achieves state-of-the-art F1-score of 55.6 on SimpleQA Verified, outperforming other frontier models including GPT-5. The benchmark provides higher-fidelity evaluation of parametric model factuality.
Conclusion: SimpleQA Verified offers a more reliable and challenging benchmark for tracking progress in LLM factuality and mitigating hallucinations, with publicly available dataset, code, and leaderboard for community use.
Abstract: We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
[55] Quantifying Genuine Awareness in Hallucination Prediction Beyond Question-Side Shortcuts
Yeongbin Seo, Dongha Lee, Jinyoung Yeo
Main category: cs.CL
TL;DR: Proposes AQE methodology to measure how much hallucination detection performance comes from question-side awareness vs genuine model awareness, revealing existing methods heavily rely on benchmark hacking.
Details
Motivation: Existing hallucination detection methods report strong performance, but it's unclear how much comes from genuine model awareness of internal information versus question-side awareness (benchmark hacking). This benchmark hacking doesn't generalize to practical usage but is hard to disentangle.Method: Proposes Approximate Question-side Effect (AQE) methodology to measure the contribution of question-side awareness to hallucination detection performance without requiring human labor.
Result: Analysis using AQE reveals that existing hallucination detection methods rely heavily on benchmark hacking rather than genuine model awareness.
Conclusion: Current hallucination detection benchmarks are flawed due to question-side awareness effects, and AQE provides a way to measure and address this issue for more reliable evaluation.
Abstract: Many works have proposed methodologies for language model (LM) hallucination detection and reported seemingly strong performance. However, we argue that the reported performance to date reflects not only a model’s genuine awareness of its internal information, but also awareness derived purely from question-side information (e.g., benchmark hacking). While benchmark hacking can be effective for boosting hallucination detection score on existing benchmarks, it does not generalize to out-of-domain settings and practical usage. Nevertheless, disentangling how much of a model’s hallucination detection performance arises from question-side awareness is non-trivial. To address this, we propose a methodology for measuring this effect without requiring human labor, Approximate Question-side Effect (AQE). Our analysis using AQE reveals that existing hallucination detection methods rely heavily on benchmark hacking.
[56] DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh, Étienne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Peter West, Giuseppe Carenini, Christopher Pal, Alexandre Drouin, Issam H. Laradji
Main category: cs.CL
TL;DR: DRBench is a benchmark for evaluating AI agents on complex, multi-step enterprise research tasks requiring information from both public web and private company knowledge bases.
Details
Motivation: Existing benchmarks focus on simple questions or web-only queries, lacking evaluation of AI agents on realistic enterprise deep research tasks that require synthesizing information from diverse sources including private company data.Method: Created 100 deep research tasks across 10 enterprise domains using a synthesis pipeline with human-in-the-loop verification. Tasks require agents to search heterogeneous sources (productivity software, cloud files, emails, chats, web) and produce structured reports.
Result: Benchmark demonstrates effectiveness by evaluating diverse deep research agents across open- and closed-source models (GPT, Llama, Qwen), revealing their strengths, weaknesses, and critical paths for advancement.
Conclusion: DRBench provides a comprehensive benchmark for enterprise deep research agents, highlighting current capabilities and future directions for AI agents in complex enterprise settings.
Abstract: We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, “What changes should we make to our product roadmap to ensure compliance with this standard?”) that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 100 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code and data are available at https://github.com/ServiceNow/drbench.
[57] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff
Main category: cs.CL
TL;DR: SynthWorlds framework creates parallel real and synthetic worlds to disentangle reasoning from factual knowledge in language models, enabling precise evaluation of reasoning vs memorization.
Details
Motivation: Current LM evaluation is confounded by extensive parametric world knowledge, making it hard to distinguish genuine reasoning from factual recall. Existing approaches cannot cleanly separate these two aspects.Method: Constructs parallel corpora with identical structure: real-mapped world (where models can use parametric knowledge) and synthetic-mapped world (where such knowledge is meaningless). Designs mirrored tasks (multi-hop QA and page navigation) with equal reasoning difficulty across both worlds.
Result: Experiments reveal persistent knowledge advantage gap - performance boost models gain from memorized knowledge. Knowledge acquisition and integration mechanisms reduce but don’t eliminate this gap.
Conclusion: SynthWorlds provides controlled environment for precise evaluation of reasoning vs memorization in LMs, enabling testable comparisons that were previously challenging.
Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
[58] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates
Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei, Shiwen Ni, Hamid Alinejad-Rokny, Min Yang
Main category: cs.CL
TL;DR: ReViewGraph is a novel framework that uses heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates to improve paper review quality by capturing complex argumentative dynamics.
Details
Motivation: Existing paper review methods have limitations: they rely on superficial manuscript features or LLMs directly, leading to hallucinations, biased scoring, limited reasoning capabilities, and failure to capture complex argumentative reasoning and negotiation dynamics in reviewer-author interactions.Method: Proposes ReViewGraph framework that: 1) Simulates reviewer-author exchanges through LLM-based multi-agent collaboration, 2) Extracts diverse opinion relations (acceptance, rejection, clarification, compromise) as typed edges in a heterogeneous interaction graph, 3) Applies graph neural networks to reason over these structured debate graphs.
Result: Extensive experiments on three datasets show ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, demonstrating the value of modeling detailed reviewer-author debate structures.
Conclusion: The ReViewGraph framework successfully addresses limitations of existing review methods by capturing fine-grained argumentative dynamics through graph reasoning over simulated debates, leading to more informed review decisions.
Abstract: Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.
[59] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection
Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng Xu
Main category: cs.CL
TL;DR: PRISM is a persona-reasoned multimodal stance detection model that addresses pseudo-multimodality and user homogeneity issues in social media conversations by incorporating user personas and aligning multimodal cues.
Details
Motivation: Existing multimodal conversational stance detection suffers from pseudo-multimodality (visual cues only in source posts, not comments) and user homogeneity (treating diverse users uniformly), limiting real-world applicability.Method: PRISM derives longitudinal user personas from historical posts/comments, aligns textual and visual cues via Chain-of-Thought reasoning, and uses mutual task reinforcement to jointly optimize stance detection and stance-aware response generation.
Result: Experiments on the new U-MStance dataset (40k+ annotated comments across 6 targets) show PRISM achieves significant gains over strong baselines in multimodal conversational stance detection.
Conclusion: User-centric and context-grounded multimodal reasoning is effective for realistic stance understanding in social media conversations, addressing key limitations of existing approaches.
Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users’ attitudes toward specific targets within complex discussions. However, existing studies remain limited by: 1) pseudo-multimodality, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and 2) user homogeneity, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce U-MStance, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose PRISM, a Persona-Reasoned multImodal Stance Model for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.
[60] From Veracity to Diffusion: Adressing Operational Challenges in Moving From Fake-News Detection to Information Disorders
Francesco Paolo Savatteri, Chahan Vidal-Gorène, Florian Cafiero
Main category: cs.CL
TL;DR: Comparing fake-news detection vs. virality prediction across datasets, showing fake-news detection is stable with good text embeddings while virality prediction is sensitive to operational choices like thresholds and observation windows.
Details
Motivation: Research on misinformation has focused on fake-news detection (veracity prediction), but social science shows information manipulation often involves amplification dynamics. The paper examines what changes empirically when prediction targets shift from veracity to diffusion, and what performance can be achieved with limited resources.Method: Comparative analysis across two datasets (EVONS and FakeNewsNet) using an evaluation-first perspective. Examines how benchmark behavior changes when prediction target shifts from veracity to diffusion. Tests lightweight, transparent pipelines for misinformation-related prediction tasks.
Result: Fake-news detection is comparatively stable once strong textual embeddings are available, whereas virality prediction is much more sensitive to operational choices such as threshold definition and early observation windows.
Conclusion: The paper proposes practical ways to operationalize lightweight, transparent pipelines for misinformation-related prediction tasks that can rival state-of-the-art approaches, highlighting the different challenges between veracity and diffusion prediction.
Abstract: A wide part of research on misinformation has relied lies on fake-news detection, a task framed as the prediction of veracity labels attached to articles or claims. Yet social-science research has repeatedly emphasized that information manipulation goes beyond fabricated content and often relies on amplification dynamics. This theoretical turn has consequences for operationalization in applied social science research. What changes empirically when prediction targets move from veracity to diffusion? And which performance level can be attained in limited resources setups ? In this paper we compare fake-news detection and virality prediction across two datasets, EVONS and FakeNewsNet. We adopt an evaluation-first perspective and examine how benchmark behavior changes when the prediction target shifts from veracity to diffusion. Our experiments show that fake-news detection is comparatively stable once strong textual embeddings are available, whereas virality prediction is much more sensitive to operational choices such as threshold definition and early observation windows. The paper proposes practical ways to operationalize lightweight, transparent pipelines for misinformation-related prediction tasks that can rival with state-of-the-art.
[61] DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee
Main category: cs.CL
TL;DR: DEER is a benchmark for evaluating expert-level deep research reports generated by LLMs, featuring a comprehensive taxonomy and claim verification system.
Details
Motivation: Current evaluation of LLM-generated expert reports is challenging due to multifaceted quality criteria, difficulty in identifying domain-specific errors, and the need for claim verification across retrieved evidence.Method: Proposes DEER benchmark with expert-developed taxonomy (7 dimensions, 25 subdimensions, 101 rubric items), task-specific Expert Evaluation Guidance for LLM-based judging, and a claim verification architecture that verifies both cited and uncited claims while quantifying evidence quality.
Result: Experiments show current deep research systems produce structurally plausible reports with external citations but need improvement in fulfilling expert-level requests and achieving logical completeness. DEER provides interpretable strengths/limitations and diagnostic signals.
Conclusion: DEER addresses critical evaluation gaps for expert-level deep research systems, enabling systematic assessment and providing actionable insights for improvement beyond simple performance comparisons.
Abstract: Recent advances in large language models have enabled deep research systems that generate expert-level reports through multi-step reasoning and evidence-based synthesis. However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and because deep research relies on retrieved evidence, report-wide claim verification is also necessary. To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports. DEER systematizes evaluation criteria with an expert-developed taxonomy (7 dimensions, 25 subdimensions) operationalized as 101 fine-grained rubric items. We also provide task-specific Expert Evaluation Guidance to support LLM-based judging. Alongside rubric-based assessment, we propose a claim verification architecture that verifies both cited and uncited claims and quantifies evidence quality. Experiments show that while current deep research systems can produce structurally plausible reports that cite external evidence, there is room for improvement in fulfilling expert-level user requests and achieving logical completeness. Beyond simple performance comparisons, DEER makes system strengths and limitations interpretable and provides diagnostic signals for improvement.
[62] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li
Main category: cs.CL
TL;DR: CRANE is a relevance-based framework for analyzing language-specific neurons in multilingual LLMs through targeted neuron interventions, revealing asymmetric language specialization patterns.
Details
Motivation: Current methods for identifying language-related neurons in multilingual LLMs rely on activation-based heuristics that conflate language preference with functional importance, creating a need for more precise analysis of how language capabilities are organized at the neuron level.Method: CRANE uses relevance-based analysis with targeted neuron-level interventions to identify language-specific neurons based on their functional necessity rather than activation magnitude. It characterizes neuron specialization by their contribution to language-conditioned predictions through masking experiments.
Result: The framework reveals consistent asymmetric patterns: masking neurons relevant to a target language selectively degrades performance on that language while largely preserving performance on other languages, indicating language-selective but non-exclusive neuron specializations. CRANE isolates language-specific components more precisely than activation-based methods across English, Chinese, and Vietnamese benchmarks.
Conclusion: CRANE provides a more accurate framework for understanding language organization in multilingual LLMs by focusing on functional necessity rather than activation patterns, offering insights into how language capabilities are structured at the neuron level.
Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.
[63] EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang, Tianyu Shi
Main category: cs.CL
TL;DR: EVM-QuestBench is an execution-grounded benchmark for evaluating natural-language transaction-script generation on EVM-compatible chains, focusing on execution accuracy and safety in blockchain development scenarios.
Details
Motivation: Existing evaluations for LLMs in blockchain development overlook execution accuracy and safety, which is critical in on-chain transaction scenarios where minor errors can cause irreversible losses for users.Method: The benchmark uses dynamic evaluation with instructions sampled from template pools, numeric parameters from predefined intervals, and validators to verify outcomes. It contains 107 tasks (62 atomic, 45 composite) with modular architecture for rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation, and composite tasks apply step-efficiency decay.
Result: Evaluation of 20 models reveals large performance gaps, with split scores showing persistent asymmetry between single-action precision and multi-step workflow completion.
Conclusion: EVM-QuestBench addresses critical gaps in evaluating LLMs for blockchain development by focusing on execution accuracy and safety, revealing significant performance disparities in transaction-script generation capabilities.
Abstract: Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.
[64] Pretraining with Token-Level Adaptive Latent Chain-of-Thought
Boyi Zeng, Yiqin Hao, He Li, Shixiang Song, Feichen Song, Zitong Wang, Siyuan Huang, Yi Xu, ZiWei He, Xinbing Wang, Zhouhan Lin
Main category: cs.CL
TL;DR: Pretraining with token-level adaptive latent Chain-of-Thought improves language modeling by generating variable-length internal reasoning trajectories before each token, allocating more computation to difficult tokens without increasing parameters.
Details
Motivation: Scaling LLMs through parameter/data scaling faces constraints from limited high-quality corpora and high communication costs. The paper explores increasing per-token computation without expanding parameters by internalizing latent reasoning processes.Method: Proposes Pretraining with Token-Level Adaptive Latent CoT, where models generate variable-length latent CoT trajectories before emitting each token. The model learns to allocate longer trajectories to difficult tokens and shorter/zero trajectories to easy ones through one-stage pretraining on general text, enabling token-wise adaptive halting.
Result: Experiments with Llama architectures show adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.
Conclusion: Increasing per-token computation through adaptive latent CoT is an effective alternative to parameter scaling, improving model performance while reducing computational costs in both training and inference.
Abstract: Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token – allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.
[65] Query-focused and Memory-aware Reranker for Long Context Processing
Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.12192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[66] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis
Rong Fu, Ziming Wang, Chunlei Meng, Jiaxuan Lu, Jiekai Wu, Kangan Qian, Hao Zhang, Simon Fong
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.16144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[67] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman, Sara Price, Samuel Marks, Rowan Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2602.22755
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Cannot analyze method without access to paper content
Result: No results available due to failed API request
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.22755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[68] SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh
Main category: cs.CL
TL;DR: Paper 2603.00718: Failed to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2603.00718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[69] PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking
He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to API rate limiting preventing access to paper content
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.02023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[70] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.05488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[71] Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis
Hazem Amamou, Stéphane Gagnon, Alan Davoust, Anderson R. Avila
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.05698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[72] A Causal Graph Approach to Oppositional Narrative Analysis
Diego Revilla, Martin Fernandez-de-Retana, Lingfeng Chen, Aritz Bilbao-Jayo, Miguel Fernandez-de-Retana
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2603.06135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[73] TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu, Enhong Chen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[74] Adaptive Loops and Memory in Transformers: Think Harder or Know More?
Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Mehdi Ali
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.08391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[75] Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA
Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.08501 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2603.08501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[76] Image Captioning via Compact Bidirectional Architecture
Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng Wang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2201.01984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2201.01984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[77] Robust Training of Neural Networks at Arbitrary Precision and Sparsity
Chengxi Ye, Grace Chu, Yanfeng Liu, Yichi Zhang, Lukasz Lew, Li Zhang, Mark Sandler, Andrew Howard
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical retrieval issues
Abstract: Failed to fetch summary for 2409.09245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.09245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[78] Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2505.11595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[79] ThinkQE: Query Expansion via an Evolving Thinking Process
Yibin Lei, Tao Shen, Andrew Yates
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2506.09260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[80] TaoSR1: The Thinking Model for E-commerce Relevance Search
Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng
Main category: cs.CL
TL;DR: Paper 2508.12365: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to abstract retrieval failureMethod: Cannot determine method due to abstract retrieval failure
Result: Cannot determine results due to abstract retrieval failure
Conclusion: Cannot determine conclusion due to abstract retrieval failure
Abstract: Failed to fetch summary for 2508.12365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[81] Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework
Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, Xin Xia
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.14093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[82] v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2509.25773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[83] NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu, Yihan Wang, Ziyan Yu, Wenzhe Cai, Lei Kang, Hao Dong
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.08173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[84] Does Scientific Writing Converge to U.S. English? Evidence from Generative AI-Assisted Publications
Dragan Filimonovic, Christian Rutzer, Jeffrey Macher, Rolf Weder
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.11687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[85] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms
Francesco Granata, Francesco Poggi, Misael Mongiovì
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.05967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli
Main category: cs.CL
TL;DR: Diffusion language models develop different representational structures than autoregressive models, with more hierarchical abstractions and early-layer redundancy, enabling efficient layer-skipping inference.
Details
Motivation: To understand how diffusion training objectives fundamentally reshape internal representations compared to autoregressive models, and whether these differences enable practical efficiency improvements.Method: Layer- and token-wise representational analysis comparing native diffusion LLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized diffusion models (Dream-7B), followed by development of static, task-agnostic inference-time layer-skipping method.
Result: Diffusion objectives create more hierarchical abstractions with early-layer redundancy and reduced recency bias, while AR models produce depth-dependent representations. AR-initialized diffusion models retain AR-like dynamics. Native diffusion models achieve up to 18.75% FLOPs reduction with >90% performance preservation via layer-skipping.
Conclusion: Training objectives fundamentally shape representational structure, with diffusion models enabling practical, cache-orthogonal efficiency gains through their hierarchical redundancy, while AR models show persistent initialization bias.
Abstract: Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
[87] Rewards as Labels: Revisiting RLVR from a Classification Perspective
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu
Main category: cs.CL
TL;DR: Unable to analyze paper 2602.05630 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.05630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.08316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Peter Brodeur, Jacob M. Koshy, Anil Palepu, Khaled Saab, Ava Homiar, Roma Ruparel, Charles Wu, Ryutaro Tanno, Joseph Xu, Amy Wang, David Stutz, Hannah M. Ferrera, David Barrett, Lindsey Crowley, Jihyeon Lee, Spencer E. Rittner, Ellery Wulczyn, Selena K. Zhang, Elahe Vedadi, Christine G. Kohn, Kavita Kulkarni, Vinay Kadiyala, Sara Mahdavi, Wendy Du, Jessica Williams, David Feinbloom, Renee Wong, Tao Tu, Petar Sirkovic, Alessio Orlandi, Christopher Semturs, Yun Liu, Juraj Gottweis, Dale R. Webster, Joëlle Barral, Katherine Chou, Pushmeet Kohli, Avinatan Hassidim, Yossi Matias, James Manyika, Rob Fields, Jonathan X. Li, Marc L. Cohen, Vivek Natarajan, Mike Schaekermann, Alan Karthikesalingam, Adam Rodman
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to API restrictions
Result: No results available - technical issue with arXiv API preventing content retrieval
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.08448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[90] Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin
Main category: cs.CV
TL;DR: Granulon is a DINOv3-based multimodal LLM with adaptive granularity augmentation that dynamically adjusts visual abstraction levels based on text input for better fine-grained understanding.
Details
Motivation: Current MLLMs rely on CLIP-based visual encoders that focus on global semantic alignment but struggle with fine-grained visual understanding, while DINOv3 provides good pixel-level perception but lacks coarse-grained semantic abstraction, creating a gap in multi-granularity reasoning.Method: Proposes Granulon with: 1) text-conditioned granularity Controller that dynamically adjusts visual abstraction level based on textual input’s semantic scope, and 2) Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens.
Result: Extensive experiments show Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
Conclusion: Granulon enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass, addressing the multi-granularity gap in current MLLMs.
Abstract: Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified “pixel-to-fine-to-coarse” reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
[91] Where, What, Why: Toward Explainable 3D-GS Watermarking
Mingshu Cai, Jiajun Li, Osamu Yoshie, Yuya Ieiri, Yixuan Li
Main category: cs.CV
TL;DR: A novel watermarking framework for 3D Gaussian Splatting that separates carrier selection from quality preservation using a Trio-Experts module and Safety and Budget Aware Gate, achieving robust watermarking with minimal visual impact.
Details
Motivation: As 3D Gaussian Splatting becomes the standard for interactive 3D assets, there's a critical need for robust yet imperceptible watermarking to protect intellectual property while maintaining visual quality.Method: Uses a Trio-Experts module operating directly on Gaussian primitives to derive priors for carrier selection. A Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers optimized for bit resilience under perturbation and bitrate budgets, and to visual compensators. Introduces a channel-wise group mask to control gradient propagation, limiting Gaussian parameter updates while repairing local artifacts and preserving high-frequency details.
Result: Achieves view-consistent watermark persistence and strong robustness against common image distortions like compression and noise. Shows PSNR improvement of +0.83 dB and bit-accuracy gain of +1.24% compared to state-of-the-art methods.
Conclusion: The framework provides effective watermarking for 3D Gaussian Splatting with a favorable robustness-quality trade-off, while offering auditable explainability through per-Gaussian attributions that reveal where messages are carried and why those carriers are selected.
Abstract: As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers, optimized for bit resilience under perturbation and bitrate budgets, and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness-quality trade-off compared with prior methods. In addition, decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.24%.
[92] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering
Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. Davison
Main category: cs.CV
TL;DR: MEGC 2026 introduces two micro-expression tasks using multimodal LLMs: ME-VQA for short video QA and ME-LVQA for long-video temporal reasoning.
Details
Motivation: Facial micro-expressions are subtle involuntary facial movements in high-stakes environments. Recent advances in ME analysis combined with emerging multimodal LLMs offer new opportunities for enhanced ME understanding through multimodal reasoning.Method: The MEGC 2026 challenge proposes two tasks: 1) ME-VQA using MLLMs/LVLMs for visual question answering on short ME videos, and 2) ME-LVQA extending to long-duration videos requiring temporal reasoning and subtle ME detection across extended periods.
Result: The paper introduces a public challenge with leaderboard for evaluating multimodal LLM approaches on micro-expression video understanding tasks, with detailed information available on the MEGC 2026 website.
Conclusion: Multimodal LLMs show promise for advancing micro-expression analysis through visual question answering tasks, with the MEGC 2026 challenge providing a framework for evaluating these capabilities on both short and long video sequences.
Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.
[93] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model
Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin Lu
Main category: cs.CV
TL;DR: VisionCreator-R1 is a visual generation agent with explicit reflection capabilities and a Reflection-Plan Co-Optimization training method that addresses optimization asymmetry in reinforcement learning for visual content generation.
Details
Motivation: Current visual generation agents are plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors, limiting their ability to handle complex multi-image workflows effectively.Method: Proposes VisionCreator-R1 with explicit reflection and Reflection-Plan Co-Optimization (RPCO) training. RPCO first trains on self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimizes on VCR-RL dataset via reinforcement learning.
Result: VisionCreator-R1 consistently outperforms Gemini2.5Pro on existing benchmarks and the proposed VCR-bench covering both single-image and multi-image tasks.
Conclusion: The paper introduces a novel approach to visual generation agents with explicit reflection capabilities and addresses optimization asymmetry in RL, demonstrating superior performance over state-of-the-art models.
Abstract: Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.
[94] Computer Vision-Based Vehicle Allotment System using Perspective Mapping
Prachi Nandi, Sonakshi Satapathy, Suchismita Chinara
Main category: cs.CV
TL;DR: A smart parking system using computer vision (YOLOv8) and inverse perspective mapping to detect available parking spaces from multiple camera views, visualized in a 3D environment.
Details
Motivation: Smart parking systems are crucial for reducing urban congestion and supporting sustainable transportation in smart cities. While automation offers efficiency benefits, current sensor-based systems face limitations in accuracy and adaptability to changing parking layouts.Method: Uses computer vision with YOLOv8 object detection model and inverse perspective mapping (IPM) to merge images from four camera views, extracting data on vacant parking spaces. Creates a simulated 3D parking environment with Cartesian plots to guide users.
Result: The system provides a cost-effective, easy-to-implement solution that dynamically assesses visual inputs and adapts to changing parking layouts, outperforming traditional sensor-based systems in accuracy and flexibility.
Conclusion: Computer vision-based smart parking systems offer superior accuracy and adaptability compared to traditional sensor technologies, providing a practical solution for urban parking management in smart cities.
Abstract: Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.
[95] A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology
Brian Isett, Rebekah Dadey, Aofei Li, Ryan C. Augustin, Kate Smith, Aatur D. Singhi, Qiangqiang Gu, Riyue Bao
Main category: cs.CV
TL;DR: MuCTaL is a multi-cancer tumor localization model trained on four cancer types that generalizes to unseen pancreatic cancer with moderate performance, using DenseNet169 transfer learning and generating spatial tumor probability heatmaps.
Details
Motivation: Deep learning tumor detection models trained on specific cancers often lack robustness across different tumor types, limiting their translational research applications in spatial analysis, molecular profiling, and tissue architecture investigation.Method: Transfer learning with DenseNet169 on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer), with balanced training across cancer types and a scalable inference workflow for spatial tumor probability heatmaps.
Result: Achieved tile-level ROC-AUC of 0.97 on validation data from the four training cancers and 0.71 on an independent pancreatic ductal adenocarcinoma cohort, demonstrating generalization to unseen tumor types.
Conclusion: Balanced multi-cancer training at modest scale can achieve high performance and generalize to unseen tumor types, with the model and code publicly available for digital pathology applications.
Abstract: Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.
[96] Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong Wang
Main category: cs.CV
TL;DR: DynHiL-EQA dataset and DIVRR framework for Embodied Question Answering in dynamic human-populated scenes with temporal changes and occlusions.
Details
Motivation: Traditional EQA assumes temporally stable environments, but real-world dynamic scenes with human activities introduce perceptual non-stationarity, transient cues, and viewpoint-dependent occlusions, requiring new approaches for robust and efficient inference.Method: Introduces DynHiL-EQA dataset with Dynamic and Static subsets, and DIVRR framework with relevance-guided view refinement and selective memory admission that verifies ambiguous observations before committing them to memory.
Result: DIVRR improves over existing baselines on both DynHiL-EQA and HM-EQA datasets in dynamic and static settings while maintaining high inference efficiency with compact memory.
Conclusion: The paper addresses critical challenges in dynamic EQA through a novel dataset and training-free framework that balances robustness under occlusions with efficient inference through selective evidence accumulation.
Abstract: Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
[97] HECTOR: Hybrid Editable Compositional Object References for Video Generation
Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma
Main category: cs.CV
TL;DR: HECTOR enables fine-grained compositional control in video generation by allowing hybrid reference conditioning from images/videos and explicit trajectory specification for each element.
Details
Motivation: Current video generation models synthesize scenes holistically without mechanisms for explicit compositional manipulation of distinct physical objects and their interactions in dynamic scenes.Method: Proposes a generative pipeline with hybrid reference conditioning (static images and/or dynamic videos) and explicit trajectory specification for each referenced element to control location, scale, and speed.
Result: HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared to existing approaches, synthesizing coherent videos that satisfy complex spatiotemporal constraints.
Conclusion: HECTOR addresses the limitation of holistic video generation by enabling fine-grained compositional control through hybrid reference conditioning and explicit trajectory specification for individual elements.
Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.
[98] Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures
David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. Pesé
Main category: cs.CV
TL;DR: Systematic evaluation of physical adversarial attacks on vision-language models for autonomous driving reveals severe vulnerabilities across three architectures (Dolphins, OmniDrive, LeapVAD) with sustained failures and critical object detection degradation.
Details
Motivation: Vision-language models are increasingly used in autonomous driving, but their robustness to physical adversarial attacks remains unexplored. The paper aims to systematically evaluate and compare the vulnerability of different VLM architectures to adversarial threats in safety-critical driving applications.Method: Developed a systematic framework for comparative adversarial evaluation across three VLM architectures. Used black-box optimization with semantic homogenization for fair comparison. Evaluated physically realizable patch attacks in CARLA simulation environment.
Result: Results show severe vulnerabilities across all three architectures, with sustained multi-frame failures and critical degradation in object detection performance. Analysis revealed distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats.
Conclusion: Current vision-language model architectures for autonomous driving have significant security vulnerabilities to physical adversarial attacks. The systematic evaluation framework exposes critical weaknesses that need to be addressed for safe deployment in real-world driving applications.
Abstract: Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.
[99] When to Lock Attention: Training-Free KV Control in Video Diffusion
Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang
Main category: cs.CV
TL;DR: KV-Lock is a training-free framework for DiT-based video diffusion models that dynamically balances background consistency and foreground quality using hallucination detection to schedule KV fusion ratios and CFG scales.
Details
Motivation: The paper addresses the core challenge in video editing where maintaining background consistency while enhancing foreground quality is difficult. Current approaches either inject full-image information causing background artifacts or use rigid background locking that limits foreground generation capacity.Method: KV-Lock uses diffusion hallucination detection to dynamically schedule two components: 1) fusion ratio between cached background key-values (KVs) and newly generated KVs, and 2) classifier-free guidance (CFG) scale. When hallucination risk is detected, it strengthens background KV locking and amplifies conditional guidance for foreground generation.
Result: Extensive experiments show KV-Lock outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks. It’s a training-free, plug-and-play module that can be integrated into any pre-trained DiT-based models.
Conclusion: KV-Lock effectively addresses the background consistency vs. foreground quality trade-off in video editing by dynamically adjusting KV fusion and CFG guidance based on hallucination detection, providing a practical solution for video diffusion models.
Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model’s capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
[100] Towards Visual Query Segmentation in the Wild
Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu, Weishi Shi, Yunhe Feng, Yan Huang, Heng Fan
Main category: cs.CV
TL;DR: VQS-4K: A new visual query segmentation benchmark for pixel-level object localization in videos, with VQ-SAM method extending SAM 2 for this task.
Details
Motivation: Current visual query localization (VQL) only locates last object appearances with bounding boxes. Need more comprehensive (all occurrences) and precise (pixel-level masks) localization for real-world applications.Method: Propose VQS-4K benchmark with 4,111 videos, 1.3M frames, 222 categories. Develop VQ-SAM method extending SAM 2 with target-specific and background distractor cues, multi-stage framework, and adaptive memory generation module.
Result: VQ-SAM achieves promising results on VQS-4K, surpassing all existing approaches. Benchmark enables comprehensive evaluation of visual query segmentation.
Conclusion: VQS-4K and VQ-SAM advance beyond current VQL paradigm, enabling more precise and comprehensive object localization in videos with potential for practical applications.
Abstract: In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
[101] $M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs
Kaixin Lin, Kunyu Peng, Di Wen, Yufan Chen, Ruiping Liu, Kailun Yang
Main category: cs.CV
TL;DR: M²-Occ: A framework for robust semantic occupancy prediction under incomplete multi-camera inputs in autonomous driving, addressing missing views through multi-view masked reconstruction and feature memory modules.
Details
Motivation: Existing camera-based semantic occupancy prediction methods assume complete surround-view observations, which rarely holds in real-world deployment due to occlusion, hardware malfunctions, or communication failures. The paper addresses the critical need for robust perception under incomplete multi-camera inputs.Method: Two complementary modules: 1) Multi-view Masked Reconstruction (MMR) leverages spatial overlap among neighboring cameras to recover missing-view representations in feature space; 2) Feature Memory Module (FMM) introduces a learnable memory bank storing class-level semantic prototypes to refine ambiguous voxel features and ensure semantic consistency.
Result: Significant improvements on nuScenes-based SurroundOcc benchmark: 4.93% IoU improvement under safety-critical missing back-view setting, with gap widening to 5.01% IoU improvement with five missing views. Achieves robustness gains without compromising full-view performance.
Conclusion: M²-Occ provides an effective solution for robust semantic occupancy prediction under incomplete camera observations, addressing real-world deployment challenges through feature-level reconstruction and semantic memory priors.
Abstract: Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.
[102] Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift
Maziar Sabouri, Nourhan Bayasi, Arman Rahmim
Main category: cs.CV
TL;DR: The paper proposes Multi-Kernel Gated Adapters (MKGA) to address domain shift in thyroid ultrasound analysis by separating geometric reasoning for segmentation from texture reasoning for malignancy assessment, improving cross-center robustness.
Details
Motivation: Thyroid ultrasound automation requires both global geometric reasoning for nodule delineation and local texture reasoning for malignancy assessment. Under cross-center domain shift, these cues degrade asymmetrically, but most multi-task pipelines use a single shared backbone causing negative transfer between tasks.Method: Proposes lightweight decoder-side adapters (MKGA and ResMKGA) that refine multi-scale skip features using complementary receptive fields with semantic, context-conditioned gating to suppress artifact-prone content before fusion. The approach separates geometric priors (better handled by ViTs) from texture cues (better preserved by CNNs under domain shift).
Result: Across two ultrasound benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines.
Conclusion: The paper demonstrates that separating geometric and texture reasoning through specialized adapters improves robustness to domain shift in medical imaging, with ViTs better for geometric priors and CNNs better for texture preservation under strong shift and artifacts.
Abstract: Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.
[103] PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments
Guoliang Zhu, Wanjun Jia, Caoyang Shao, Yuheng Zhang, Zhiyong Li, Kailun Yang
Main category: cs.CV
TL;DR: PanoAffordanceNet is a framework for holistic affordance grounding in 360° indoor environments, addressing challenges like geometric distortions and semantic dispersion through distortion-aware modulation and multi-level constraints.
Details
Motivation: Current affordance grounding methods are object-centric and limited to perspective views, failing to provide global perception needed for embodied agents operating in 360° spaces. There's a need for holistic scene-level understanding in panoramic environments.Method: Proposes PanoAffordanceNet with Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration to handle ERP distortions, and Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. Uses multi-level constraints including pixel-wise, distributional, and region-text contrastive objectives.
Result: Significantly outperforms existing methods on the newly constructed 360-AGD dataset, establishing a solid baseline for scene-level perception in embodied intelligence. The framework effectively suppresses semantic drift under low supervision.
Conclusion: The work introduces holistic affordance grounding in 360° environments, addresses unique challenges of panoramic perception, and provides both a novel framework (PanoAffordanceNet) and benchmark dataset (360-AGD) for embodied intelligence research.
Abstract: Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.
[104] Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
Mohamed Harmanani, Bining Long, Zhuoxin Guo, Paul F. R. Wilson, Amirhossein Sabour, Minh Nguyen Nhat To, Gabor Fichtinger, Purang Abolmaesumi, Parvin Mousavi
Main category: cs.CV
TL;DR: MedCBR integrates clinical guidelines with vision-language models to create interpretable concept-based reasoning for medical imaging, achieving strong diagnostic performance while generating structured clinical narratives.
Details
Motivation: Current Concept Bottleneck Models (CBMs) in medical imaging lack integration of broader clinical context like diagnostic guidelines and expert heuristics, reducing reliability in complex cases despite their interpretability benefits.Method: Proposes MedCBR framework that transforms labeled clinical descriptors into guideline-conformant text, trains a concept-based model with multitask objective (multimodal contrastive alignment, concept supervision, diagnostic classification), and uses a reasoning model to generate structured clinical narratives explaining diagnoses.
Result: Achieves superior diagnostic performance with AUROCs of 94.2% on ultrasound and 84.0% on mammography, plus 86.1% accuracy on non-medical datasets, while generating interpretable clinical narratives.
Conclusion: MedCBR enhances interpretability in medical AI by bridging image analysis to clinical decision-making through guideline-integrated concept-based reasoning and narrative generation.
Abstract: Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.
[105] TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers
Yihua Liu, Fanjiang Ye, Bowen Lin, Rongyu Fang, Chengming Zhang
Main category: cs.CV
TL;DR: TIDE is a training-free method for Diffusion Transformers that enables high-resolution text-to-image generation beyond training resolution by addressing attention dilution through text anchoring and dynamic temperature control.
Details
Motivation: Diffusion Transformers (DiTs) struggle with generating images at higher resolutions than their training resolution due to attention dilution, which causes structural degradation and loss of semantic details. Previous methods that sharpen attention distributions fail to preserve fine-grained details and introduce artifacts.Method: TIDE introduces two key mechanisms: 1) Text anchoring to correct the imbalance between text and image tokens and prevent prompt information loss, and 2) Dynamic temperature control that leverages spectral progression patterns in the diffusion process to eliminate artifacts. The method is training-free and works with arbitrary resolutions and aspect ratios.
Result: Extensive evaluations show TIDE delivers high-quality resolution extrapolation capability, seamlessly integrates with existing state-of-the-art methods, and generates images at arbitrary resolutions without additional sampling overhead.
Conclusion: TIDE effectively addresses the resolution extrapolation problem in Diffusion Transformers through training-free mechanisms that preserve semantic details and eliminate artifacts, enabling flexible high-resolution text-to-image generation.
Abstract: Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
[106] FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan Tian
Main category: cs.CV
TL;DR: FetalAgents: A multi-agent system for comprehensive fetal ultrasound analysis using coordinated vision experts for diagnosis, measurement, segmentation, and end-to-end video stream summarization with clinical report generation.
Details
Motivation: Fetal ultrasound interpretation relies heavily on clinician expertise, and existing automated tools struggle to balance task-specific accuracy with the whole-process versatility needed for end-to-end clinical workflows.Method: Proposes FetalAgents, a multi-agent system with lightweight agentic coordination framework that dynamically orchestrates specialized vision experts for comprehensive fetal US analysis, supporting both static image analysis and end-to-end video stream summarization with automatic keyframe identification and structured report generation.
Result: Extensive multi-center external evaluations across eight clinical tasks show FetalAgents consistently delivers the most robust and accurate performance compared to specialized models and multimodal large language models (MLLMs).
Conclusion: FetalAgents provides an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting that advances beyond static image analysis to support comprehensive clinical workflows.
Abstract: Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.
[107] Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation
Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing Zhu
Main category: cs.CV
TL;DR: ATLAS introduces a continual learning benchmark and method for Audio-Visual Segmentation that handles evolving audio-visual distributions over time using audio-guided pre-fusion conditioning and Low-Rank Anchoring to prevent forgetting.
Details
Motivation: Real-world audio-visual environments are dynamic with evolving distributions over time, but existing AVS systems assume static training settings, creating a gap for lifelong learning scenarios.Method: Proposes ATLAS with audio-guided pre-fusion conditioning (modulating visual features via projected audio context before cross-modal attention) and Low-Rank Anchoring (stabilizing adapted weights based on loss sensitivity) to mitigate catastrophic forgetting.
Result: Extensive experiments show competitive performance across diverse continual learning scenarios, establishing a foundation for lifelong audio-visual perception.
Conclusion: First exemplar-free continual learning benchmark for AVS with strong baseline method that addresses evolving audio-visual distributions in real-world environments.
Abstract: Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}
[108] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario, Lars Lundqvist, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles
Main category: cs.CV
TL;DR: VLMs (Gemma 3 and Qwen3-VL) are used to generate plant simulation configurations in JSON format from drone images for agricultural digital twins, evaluated on synthetic cowpea data with mixed results.
Details
Motivation: Functional-structural plant models (FSPMs) are complex and low-throughput, creating bottlenecks for large-scale deployment in agriculture. There's a need for scalable methods to generate simulation parameters from remote sensing data for digital twins.Method: Uses state-of-the-art open-source VLMs (Gemma 3 and Qwen3-VL) to generate simulation parameters in JSON format from drone images. Tests five in-context learning methods on synthetic cowpea plot dataset generated via Helios 3D procedural plant generation library. Evaluates across JSON integrity, geometric evaluations, and biophysical evaluations.
Result: VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, but exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on real-world drone data and ablation study characterize reasoning capabilities versus contextual priors.
Conclusion: First study to utilize VLMs for generating structural JSON configurations for plant simulations, providing a scalable framework for reconstructing 3D plots for agricultural digital twins, though models have limitations with insufficient visual cues.
Abstract: This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs – Gemma 3 and Qwen3-VL – to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models’ reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
[109] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration
Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, Rajendra Singh, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi
Main category: cs.CV
TL;DR: PathoScribe is a retrieval-augmented LLM framework that transforms static pathology archives into searchable, reasoning-enabled libraries for natural language case exploration, cohort construction, and clinical question answering.
Details
Motivation: Pathology reports contain valuable accumulated experience but remain largely inaccessible despite digitization. Current archives risk becoming passive data repositories without effective retrieval and reasoning mechanisms, preventing institutional knowledge from meaningfully informing patient care.Method: A unified retrieval-augmented large language model framework that enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry panel recommendation, and prompt-controlled report transformation within a single architecture.
Result: Evaluated on 70,000 multi-institutional surgical pathology reports, achieved perfect Recall@10 for natural language case retrieval, high-quality retrieval-grounded reasoning (mean 4.56/5), and automated cohort construction with 91.3% agreement to human reviewers in mean 9.2 minutes.
Conclusion: PathoScribe establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms, enabling real-time interrogation of prior similar cases during diagnostic evaluation.
Abstract: Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.
[110] LiM-YOLO: Less is More with Pyramid Level Shift and Normalized Auxiliary Branch for Ship Detection in Optical Remote Sensing Imagery
Seon-Hoon Kim, Hyeji Sim, Youeyun Jung, Ok-Chul Jung, Yerin Kim
Main category: cs.CV
TL;DR: LiM-YOLO proposes a streamlined object detector for ship detection in satellite imagery by shifting feature pyramid levels from P3-P5 to P2-P4 to address extreme scale disparity and high aspect ratios of maritime targets.
Details
Motivation: General-purpose object detectors struggle with ship detection in satellite imagery due to extreme scale disparity and high aspect ratios. The deepest feature pyramid level (P5) compresses narrow vessels into sub-pixel representations, causing severe spatial feature dilution that prevents resolving fine-grained ship boundaries.Method: 1) Statistical analysis of ship scale distributions across four benchmarks; 2) Pyramid Level Shift Strategy reconfiguring detection head from conventional P3-P5 to P2-P4 to comply with Nyquist sampling for small targets; 3) Group Normalized Convolutional Block for Linear Projection (GN-CBLinear) replacing batch-dependent normalization with Group Normalization to stabilize training on high-resolution satellite inputs in memory-constrained micro-batch regimes.
Result: Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO achieves state-of-the-art detection accuracy with significantly fewer parameters than existing methods.
Conclusion: A well-targeted pyramid level shift can achieve a “Less is More” balance between accuracy and efficiency for ship detection in satellite imagery.
Abstract: Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and high aspect ratios of maritime targets. In conventional YOLO architectures, the deepest feature pyramid level (P5, stride of 32) compresses narrow vessels into sub-pixel representations, causing severe spatial feature dilution that prevents the network from resolving fine-grained ship boundaries. In this work, we propose LiM-YOLO (Less is More YOLO), a streamlined detector designed to address these domain-specific structural conflicts. Through a statistical analysis of ship scale distributions across four major benchmarks, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head from the conventional P3-P5 to P2-P4. This shift ensures compliance with the Nyquist sampling condition for small targets while eliminating the computational redundancy inherent in the deep P5 layers. To further stabilize training on high-resolution satellite inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which replaces batch-dependent normalization with Group Normalization to overcome gradient instability in memory-constrained micro-batch regimes. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO achieves state-of-the-art detection accuracy with significantly fewer parameters than existing methods, validating that a well-targeted pyramid level shift can achieve a “Less is More” balance between accuracy and efficiency. The code is available at https://github.com/egshkim/LiM-YOLO.
[111] BiCLIP: Domain Canonicalization via Structured Geometric Transformation
Pranav Mantini, Shishir K. Shah
Main category: cs.CV
TL;DR: BiCLIP is a simple framework that applies targeted geometric transformations to multimodal features to enhance cross-modal alignment for few-shot domain adaptation in vision-language models.
Details
Motivation: Vision-language models have strong zero-shot capabilities but struggle with domain adaptation. The paper hypothesizes that image features across domains are related by canonical geometric transformations that can be recovered using few-shot anchors.Method: BiCLIP applies targeted bilinear transformations to multimodal features to align them across domains. It uses few-shot labeled samples as anchors to estimate the transformation, maintaining extreme simplicity and low parameter count.
Result: Achieves state-of-the-art results across 11 standard benchmarks including EuroSAT, DTD, and FGVCAircraft. Empirical analysis confirms the orthogonality and angular distribution of learned transformations, validating the geometric alignment hypothesis.
Conclusion: Structured geometric alignment through targeted transformations is key to robust domain adaptation in vision-language models, enabling effective few-shot learning with minimal parameters.
Abstract: Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP
[112] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing
Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin Cheung
Main category: cs.CV
TL;DR: SVG-EAR is a parameter-free method that accelerates video diffusion transformers by using semantic clustering and error-aware routing to approximate skipped attention blocks, achieving up to 1.93× speedup while maintaining generation quality.
Details
Motivation: Diffusion Transformers (DiTs) face quadratic attention cost bottlenecks in video generation. Existing sparse attention methods either drop blocks (causing information loss) or use learned predictors (adding training overhead and potential distribution shifts).Method: 1) Semantic clustering of keys/values to identify similar blocks; 2) Parameter-free linear compensation branch using cluster centroids to approximate skipped blocks; 3) Error-aware routing with lightweight probe to estimate compensation error and compute blocks with highest error-to-cost ratio.
Result: Achieves up to 1.77× and 1.93× speedups on Wan2.2 and HunyuanVideo datasets while maintaining PSNRs of 29.759 and 31.043 respectively. Establishes Pareto frontier over prior approaches with better quality-efficiency trade-off.
Conclusion: SVG-EAR provides an effective, training-free solution to accelerate video diffusion transformers by intelligently approximating attention blocks through semantic clustering and error-aware routing, overcoming limitations of existing sparse attention methods.
Abstract: Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
[113] SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training
Jingxing Li, Yongjae Leeand, Deliang Fan
Main category: cs.CV
TL;DR: SkipGS reduces 3D Gaussian Splatting training time by selectively skipping backward passes when view losses plateau, achieving 23.1% faster training with comparable quality.
Details
Motivation: 3D Gaussian Splatting (3DGS) achieves real-time novel-view synthesis but has expensive training costs, especially in the post-densification refinement phase where backward passes dominate runtime. The authors observed substantial update redundancy - many sampled views have plateaued losses but still undergo full backpropagation.Method: SkipGS introduces a view-adaptive backward gating mechanism for efficient post-densification training. It always performs forward passes to update per-view loss statistics, but selectively skips backward passes when the sampled view’s loss is consistent with its recent baseline. The method enforces a minimum backward budget for stable optimization.
Result: On Mip-NeRF 360, SkipGS reduces end-to-end training time by 23.1% compared to 3DGS, driven by a 42.0% reduction in post-densification time, while maintaining comparable reconstruction quality. The method is plug-and-play and compatible with other efficiency strategies.
Conclusion: SkipGS provides an effective approach to accelerate 3DGS training by intelligently skipping redundant backward passes without modifying the core representation or renderer, making it a practical efficiency enhancement for 3D scene reconstruction.
Abstract: 3D Gaussian Splatting (3DGS) achieves real-time novel-view synthesis by optimizing millions of anisotropic Gaussians, yet its training remains expensive, with the backward pass dominating runtime in the post-densification refinement phase. We observe substantial update redundancy in this phase: many sampled views have near-plateaued losses and provide diminishing gradient benefits, but standard training still runs full backpropagation. We propose SkipGS with a novel view-adaptive backward gating mechanism for efficient post-densification training. SkipGS always performs the forward pass to update per-view loss statistics, and selectively skips backward passes when the sampled view’s loss is consistent with its recent per-view baseline, while enforcing a minimum backward budget for stable optimization. On Mip-NeRF 360, compared to 3DGS, SkipGS reduces end-to-end training time by 23.1%, driven by a 42.0% reduction in post-densification time, with comparable reconstruction quality. Because it only changes when to backpropagate – without modifying the renderer, representation, or loss – SkipGS is plug-and-play and compatible with other complementary efficiency strategies for additive speedups.
[114] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
Bolutife Atoki, Iuliia Tkachenko, Bertrand Kerautret, Carlos Crispim-Junior
Main category: cs.CV
TL;DR: Diffusion-based authentication framework for anti-counterfeiting using printer classification on Copy Detection Patterns
Details
Motivation: Counterfeiting poses serious risks across industries, and traditional authentication systems fail against high-quality counterfeits enabled by advanced printing/scanning and generative AIMethod: Proposes diffusion-based framework that jointly leverages original binary template, printed CDP, and printer identity representation; extends ControlNet for class-conditioned noise prediction to enable printer classification
Result: Outperforms traditional similarity metrics and prior deep learning approaches on Indigo 1 x 1 Base dataset; generalizes to counterfeit types unseen during training
Conclusion: Diffusion-based authentication effectively captures device-specific features for anti-counterfeiting, showing promise for robust printer classification systems
Abstract: Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.
[115] WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion
Zekun Long, Ali Zia, Guanyiman Fu, Vivien Rolland, Jun Zhou
Main category: cs.CV
TL;DR: WS-Net: A deep unmixing framework for hyperspectral images that addresses weak-signal collapse using state-space modeling and Weak Signal Attention fusion to improve abundance estimation of subtle spectral responses.
Details
Motivation: Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, leading to inaccurate abundance estimation. Current methods struggle with weak-signal collapse where subtle spectral features get overwhelmed by stronger signals.Method: WS-Net uses a multi-resolution wavelet-fused encoder with a hybrid backbone: Mamba state-space branch for long-range dependency modeling and Weak Signal Attention branch to enhance low-similarity spectral cues. A learnable gating mechanism fuses both representations, and the decoder uses KL-divergence-based regularization to enforce separability between dominant and weak endmembers.
Result: Experiments on simulated and real datasets (synthetic dataset, Samson, and Apex) show consistent improvements over six state-of-the-art baselines, achieving up to 55% reduction in RMSE and 63% reduction in SAD. The framework maintains stable accuracy under low-SNR conditions, especially for weak endmembers.
Conclusion: WS-Net establishes a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing, effectively addressing the weak-signal collapse problem through its innovative architecture combining state-space modeling and attention mechanisms.
Abstract: Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, resulting in inaccurate abundance estimation. This paper introduces WS-Net, a deep unmixing framework specifically designed to address weak-signal collapse through state-space modelling and Weak Signal Attention fusion. The network features a multi-resolution wavelet-fused encoder that captures both high-frequency discontinuities and smooth spectral variations with a hybrid backbone that integrates a Mamba state-space branch for efficient long-range dependency modelling. It also incorporates a Weak Signal Attention branch that selectively enhances low-similarity spectral cues. A learnable gating mechanism adaptively fuses both representations, while the decoder leverages KL-divergence-based regularisation to enforce separability between dominant and weak endmembers. Experiments on one simulated and two real datasets (synthetic dataset, Samson, and Apex) demonstrate consistent improvements over six state-of-the-art baselines, achieving up to 55% and 63% reductions in RMSE and SAD, respectively. The framework maintains stable accuracy under low-SNR conditions, particularly for weak endmembers, establishing WS-Net as a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing.
[116] Spectral-Structured Diffusion for Single-Image Rain Removal
Yucheng Xing, Xin Wang
Main category: cs.CV
TL;DR: SpectralDiff: A spectral-structured diffusion framework for single-image rain removal that incorporates structured spectral perturbations to guide progressive suppression of multi-directional rain components, with a full-product U-Net architecture for efficiency.
Details
Motivation: Rain streaks are directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal challenging. Standard spatial-domain diffusion models don't explicitly account for these structured spectral characteristics.Method: Introduces SpectralDiff, a spectral-structured diffusion framework that incorporates structured spectral perturbations to guide progressive suppression of multi-directional rain components. Proposes a full-product U-Net architecture that uses convolution theorem to replace convolution operations with element-wise product layers for improved efficiency.
Result: Extensive experiments on synthetic and real-world benchmarks show competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
Conclusion: SpectralDiff provides an effective diffusion-based framework for rain removal that explicitly accounts for spectral characteristics of rain streaks while maintaining computational efficiency.
Abstract: Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
[117] Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework
Ammar K. AlMhdawi, Nonso Nnamoko, Alaa Mashan Ubaid
Main category: cs.CV
TL;DR: Enhanced dual-model YOLOv8 framework for fire detection with proximity-aware risk assessment, combining fire/smoke segmentation with object detection to compute distances and generate risk scores.
Details
Motivation: Extend conventional vision-based fire monitoring beyond simple detection to actionable hazard prioritization by assessing proximity risks between fire/smoke and surrounding entities like people, vehicles, and infrastructure.Method: Dual-model approach: primary YOLOv8 instance segmentation for fire/smoke detection + secondary object detection model (pretrained on COCO) for identifying surrounding entities. Integrates outputs to compute pixel-based distances between fire regions and objects, converting to real-world measurements via pixel-to-meter scaling. Risk assessment combines fire evidence, object vulnerability, and distance-based exposure.
Result: Achieves strong performance: precision, recall, and F1 scores >90%, mAP@0.5 >91%. Generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information. Framework is lightweight and suitable for industrial/resource-constrained deployment.
Conclusion: Proposed framework successfully extends fire detection to risk-aware monitoring by integrating proximity assessment, providing actionable intelligence for emergency response and industrial safety applications.
Abstract: This study proposes an enhanced dual-model YOLOv8 framework for intelligent fire detection and proximity-aware risk assessment, extending conventional vision-based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel-based distances between detected fire regions and nearby objects and converts these values into approximate real-world measurements using a pixel-to-meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance-based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and mAP@0.5 above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open-source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource-constrained settings.
[118] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
Md Selim Sarowar, Omer Tariq, Sungho Kim
Main category: cs.CV
TL;DR: GST-VLA introduces 3D Gaussian spatial tokens and depth-aware chain-of-thought reasoning for vision-language-action models, improving performance on robotic manipulation tasks.
Details
Motivation: Current VLA models encode visual observations as 2D patch tokens without geometric structure, limiting their ability to handle 3D spatial reasoning tasks like robotic manipulation that require understanding depth, orientation, and spatial relationships.Method: Two main contributions: 1) Gaussian Spatial Tokenizer (GST) converts depth and semantic features into anisotropic 3D Gaussian primitives with geometric properties; 2) 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured spatial thoughts. Uses cross-attention to access Gaussian field during reasoning and a flow-matching action expert for action decoding.
Result: Achieves 96.4% on LIBERO (+2.0% improvement) and 80.2% on SimplerEnv (+5.4% improvement). Ablations confirm independent and synergistic gains from each component, especially on precision-demanding tasks.
Conclusion: Explicit 3D geometric representation through Gaussian spatial tokens combined with structured spatial reasoning significantly improves VLA model performance on tasks requiring precise spatial understanding and manipulation.
Abstract: VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ\in \mathbb{R}^3$, log-scale covariance $\log σ\in \mathbb{R}^3$, and learned opacity $α\in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}\mathrm{flow} + \mathcal{L}\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
[119] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing
Lixiang Lin, Siyuan Jin, Jinshan Zhang
Main category: cs.CV
TL;DR: OmniEdit is a training-free framework for lip synchronization and audio-visual editing that reformulates the editing paradigm to provide unbiased estimation and stable editing trajectories without requiring supervised fine-tuning.
Details
Motivation: Existing methods for lip synchronization and audio-visual editing require supervised fine-tuning of pre-trained models, leading to high computational overhead and data requirements. The authors aim to develop a training-free framework to address these limitations.Method: The approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding unbiased estimation of desired output. It also removes stochastic elements from the generation process to establish smooth and stable editing trajectories.
Result: Extensive experimental results validate the effectiveness and robustness of the proposed framework. The method achieves lip synchronization and audio-visual editing without requiring training or fine-tuning.
Conclusion: OmniEdit provides a training-free solution for lip synchronization and audio-visual editing that reduces computational overhead and data requirements while maintaining effectiveness and robustness.
Abstract: Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at https://github.com/l1346792580123/OmniEdit.
[120] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei
Main category: cs.CV
TL;DR: A framework for physically plausible video generation that decomposes physical phenomena into causal event chains using physics-driven reasoning and transition-aware cross-modal prompting.
Details
Motivation: Current video generation models lack proper conditioning mechanisms for modeling causal progression in physical phenomena, often rendering them as single moments rather than dynamically evolving sequences of causally connected events.Method: Two key modules: 1) Physics-driven Event Chain Reasoning that decomposes physical phenomena into elementary event units using chain-of-thought reasoning with physical formula constraints, and 2) Transition-aware Cross-modal Prompting that transforms causal event units into temporally aligned vision-language prompts for continuity.
Result: Superior performance on PhyGenBench and VideoPhy benchmarks for generating physically plausible videos across diverse physical domains.
Conclusion: The framework successfully models physical phenomena as sequences of causally connected events, addressing limitations of current video generation models in capturing dynamic evolution and causal progression.
Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
[121] MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
Chenran Zhang, Ruiqi Wu, Tao Zhou, Yi Zhou
Main category: cs.CV
TL;DR: MedKCO introduces a knowledge-driven cognitive orchestration method for medical vision-language pretraining that uses curriculum learning with diagnostic sensitivity and intra-class representativeness ordering, plus self-paced asymmetric contrastive loss to handle medical image similarity.
Details
Motivation: Current medical VLP methods force models to learn simple and complex concepts simultaneously, which is anti-cognitive and leads to suboptimal feature representations, especially under distribution shift. The authors aim to address this limitation by introducing a more cognitively-aligned pretraining approach.Method: Proposes MedKCO with two key components: 1) Two-level curriculum ordering of pretraining data based on diagnostic sensitivity and intra-class sample representativeness, 2) Self-paced asymmetric contrastive loss that dynamically adjusts pretraining objective participation to handle medical image inter-class similarity.
Result: Extensive experiments on three medical imaging scenarios across multiple vision-language downstream tasks show the method significantly surpasses all baselines and curriculum learning methods.
Conclusion: MedKCO provides an effective knowledge-driven cognitive orchestration approach for medical VLP that improves feature learning and generalization through curriculum-based data ordering and adaptive contrastive loss.
Abstract: Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.
[122] Training-free Motion Factorization for Compositional Video Generation
Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie Lei
Main category: cs.CV
TL;DR: Motion factorization framework for compositional video generation that decomposes motion into three categories (motionlessness, rigid, non-rigid) and uses planning before generation approach with model-agnostic modules.
Details
Motivation: Current video generation approaches focus on binding semantics but neglect understanding diverse motion categories specified in prompts, limiting their ability to synthesize multiple instances with diverse appearance and motion.Method: Two-stage framework: (1) Planning stage reasons about motion laws on motion graph to obtain frame-wise shape/position changes, organizing prompts into structured instance representations; (2) Generation stage modulates synthesis of different motion categories in disentangled manner using guidance branches conditioned on motion cues.
Result: Extensive experiments demonstrate impressive performance in motion synthesis on real-world benchmarks, with model-agnostic modules that can be incorporated into various diffusion model architectures.
Conclusion: Proposed motion factorization framework effectively addresses motion understanding in compositional video generation, achieving state-of-the-art performance while being adaptable to different model architectures.
Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
[123] Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee
Main category: cs.CV
TL;DR: A transformer-based framework for medical image retrieval that combines lesion images with textual descriptors to find clinically relevant cases, with hierarchical query representations and joint global-local alignment.
Details
Motivation: Medical image retrieval needs to support diagnostic decision making, education, and quality control. In practice, queries often combine reference lesion images with textual descriptors (like dermoscopic features), requiring composed vision-language retrieval for skin cancer cases.Method: Transformer-based framework learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. Uses convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency.
Result: Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods.
Conclusion: The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment for skin cancer diagnosis and education.
Abstract: Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
[124] VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li, Xihe Qiu
Main category: cs.CV
TL;DR: VIVID-Med uses frozen LLM as structured semantic teacher to pretrain medical vision transformers via verifiable JSON field-state pairs, achieving superior performance with efficient deployment.
Details
Motivation: Current medical vision-language pretraining methods use one-hot labels or free-form text, which fail to capture complex semantic relationships among clinical findings needed for effective medical image analysis.Method: Leverages frozen LLM as structured semantic teacher, translates findings into verifiable JSON field-state pairs via Unified Medical Schema, uses answerability-aware masking, employs Structured Prediction Decomposition with orthogonality-regularized query groups, and discards LLM post-training for lightweight ViT-only deployment.
Result: Achieves 0.8588 macro-AUC on CheXpert linear probing (outperforming BiomedCLIP by +6.65 points with 500x less data), robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC), and strong cross-modality generalization to CT with 0.8413 AUC on lung nodule classification and 0.9969 macro-AUC on organ classification.
Conclusion: VIVID-Med provides highly efficient, scalable alternative to resource-heavy vision-language models for clinical deployment by using LLM as structured teacher during training while maintaining lightweight ViT-only inference.
Abstract: Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.
[125] Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities
Jindi Bao, Jianjun Qian, Mengkai Yan, Jian Yang
Main category: cs.CV
TL;DR: PRLF: Progressive Representation Learning Framework for Multimodal Sentiment Analysis with missing modalities, using adaptive reliability estimation and progressive alignment to handle incomplete multimodal data.
Details
Motivation: Real-world multimodal applications often face missing modalities due to noise, hardware failures, or privacy restrictions, creating feature misalignment between complete and incomplete modalities that can distort learned representations.Method: Proposes PRLF with Adaptive Modality Reliability Estimator (AMRE) to dynamically quantify modality reliability using recognition confidence and Fisher information, and Progressive Interaction (ProgInteract) module to iteratively align other modalities with the dominant one.
Result: Outperforms state-of-the-art methods on CMU-MOSI, CMU-MOSEI, and SIMS datasets across both inter- and intra-modality missing scenarios, demonstrating robustness and generalization capability.
Conclusion: PRLF effectively addresses missing modality challenges in multimodal sentiment analysis by adaptively estimating reliability and progressively aligning representations, showing superior performance in real-world incomplete multimodal scenarios.
Abstract: Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.
[126] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model
Junjie Yin, Jiaju Li, Hanfa Xing
Main category: cs.CV
TL;DR: QUSR: A diffusion-based image super-resolution model that uses quality-aware priors from MLLMs and uncertainty-guided noise generation to handle real-world degradations.
Details
Motivation: Current diffusion-based image super-resolution methods struggle with real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts.Method: Proposes QUSR with two key components: 1) Uncertainty-Guided Noise Generation (UNG) module that adaptively adjusts noise injection intensity based on uncertainty regions, and 2) Quality-Aware Prior (QAP) that leverages Multimodal Large Language Models to generate reliable quality descriptions as interpretable priors.
Result: Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios, outperforming existing methods.
Conclusion: QUSR effectively addresses real-world super-resolution challenges by combining uncertainty-guided noise adaptation with MLLM-based quality priors for improved restoration.
Abstract: Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at https://github.com/oTvTog/QUSR.
[127] Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging
Mohseu Rashid Subah, Mohammed Abdul Gani Zilani, Thomas L. Nickolas, Matthew R. Allen, Stuart J. Warden, Rachel K. Surowiec
Main category: cs.CV
TL;DR: Automated transformer-based segmentation of HR-pQCT images enables osteoporosis classification using radiomics features from soft tissues, outperforming bone-only approaches.
Details
Motivation: Current osteoporosis diagnosis using DXA overlooks bone microarchitecture and soft tissues. HR-pQCT provides 3D imaging but existing analysis focuses only on mineralized bone, leaving valuable image data underutilized.Method: Used SegFormer transformer architecture for fully automated multi-region segmentation of HR-pQCT images (cortical/trabecular bone, tibia/fibula, soft tissues). Extracted 939 radiomic features per region, applied dimensionality reduction, and trained six ML classifiers on 20,496 images from 122 scans.
Result: SegFormer achieved 95.36% mean F1 score for segmentation. Myotendinous tissue features yielded best performance: 80.08% accuracy and 0.85 AUROC at image level, outperforming bone-based models. At patient level, soft tissue radiomics improved AUROC from 0.792 to 0.875 compared to standard parameters.
Conclusion: Automated multi-region HR-pQCT segmentation enables extraction of clinically informative signals beyond bone alone, highlighting the importance of integrated tissue assessment for osteoporosis detection.
Abstract: Osteoporosis is a skeletal disease typically diagnosed using dual-energy X-ray absorptiometry (DXA), which quantifies areal bone mineral density but overlooks bone microarchitecture and surrounding soft tissues. High-resolution peripheral quantitative computed tomography (HR-pQCT) enables three-dimensional microstructural imaging with minimal radiation. However, current analysis pipelines largely focus on mineralized bone compartments, leaving much of the acquired image data underutilized. We introduce a fully automated framework for binary osteoporosis classification using radiomics features extracted from anatomically segmented HR-pQCT images. To our knowledge, this work is the first to leverage a transformer-based segmentation architecture, i.e., the SegFormer, for fully automated multi-region HR-pQCT analysis. The SegFormer model simultaneously delineated the cortical and trabecular bone of the tibia and fibula along with surrounding soft tissues and achieved a mean F1 score of 95.36%. Soft tissues were further subdivided into skin, myotendinous, and adipose regions through post-processing. From each region, 939 radiomic features were extracted and dimensionally reduced to train six machine learning classifiers on an independent dataset comprising 20,496 images from 122 HR-pQCT scans. The best image level performance was achieved using myotendinous tissue features, yielding an accuracy of 80.08% and an area under the receiver operating characteristic curve (AUROC) of 0.85, outperforming bone-based models. At the patient level, replacing standard biological, DXA, and HR-pQCT parameters with soft tissue radiomics improved AUROC from 0.792 to 0.875. These findings demonstrate that automated, multi-region HR-pQCT segmentation enables the extraction of clinically informative signals beyond bone alone, highlighting the importance of integrated tissue assessment for osteoporosis detection.
[128] Rotation Equivariant Mamba for Vision Tasks
Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng, Zongben Xu
Main category: cs.CV
TL;DR: EQ-VMamba introduces rotation equivariance to visual Mamba architectures, addressing their sensitivity to image rotations through novel cross-scan strategy and group Mamba blocks.
Details
Motivation: Current Mamba-based vision architectures lack rotation equivariance, making them sensitive to image rotations and limiting their robustness and cross-task generalization. Rotation symmetry is a fundamental geometric prior in images that should be incorporated.Method: Proposes EQ-VMamba with rotation equivariant cross-scan strategy and group Mamba blocks. Provides theoretical analysis of intrinsic equivariance error to ensure end-to-end rotation equivariance throughout the network.
Result: Achieves superior or competitive performance across high-level classification, mid-level segmentation, and low-level super-resolution tasks compared to non-equivariant baselines, with ~50% fewer parameters.
Conclusion: Embedding rotation equivariance enhances robustness against rotation transformations, improves overall performance, and provides significant parameter efficiency for visual Mamba models.
Abstract: Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.
[129] Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G
Loc X. Nguyen, Ji Su Yoon, Huy Q. Le, Yu Qiao, Avi Deb Raha, Eui-Nam Huh, Nguyen H. Tran, Choong Seon Hong
Main category: cs.CV
TL;DR: Agentic AI system for managing federated learning over 6G networks, using specialized agents to optimize client selection, resource allocation, and training while considering network conditions.
Details
Motivation: The need for efficient on-device learning with diverse distributed data while meeting strict latency, bandwidth, and reliability constraints in wireless systems, particularly for 6G networks.Method: Proposes an Agentic AI control layer with specialized agents (retrieval, planning, coding, evaluation) that use monitoring tools and optimization methods to manage FL tasks including client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation.
Result: Case study demonstrates effectiveness of the Agentic AI system’s use of tools for achieving high performance in federated learning over 6G networks.
Conclusion: Agentic AI provides a comprehensive solution for managing federated learning as both a learning and network management challenge, enabling efficient on-device learning in wireless systems through closed-loop evaluation and adaptive decision-making.
Abstract: The shift toward user-customized on-device learning places new demands on wireless systems: models must be trained on diverse, distributed data while meeting strict latency, bandwidth, and reliability constraints. To address this, we propose an Agentic AI as the control layer for managing federated learning (FL) over 6G networks, which translates high-level task goals into actions that are aware of network conditions. Rather than simply viewing FL as a learning challenge, our system sees it as a combined task of learning and network management. A set of specialized agents focused on retrieval, planning, coding, and evaluation utilizes monitoring tools and optimization methods to handle client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation. The use of closed-loop evaluation and memory allows the system to consistently refine its decisions, taking into account varying signal-to-noise ratios, bandwidth conditions, and device capabilities. Finally, our case study has demonstrated the effectiveness of the Agentic AI system’s use of tools for achieving high performance.
[130] RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation
Kunyu Tan, Mingjian Liang
Main category: cs.CV
TL;DR: RTFDNet is a three-branch encoder-decoder network for robust RGB-Thermal semantic segmentation that unifies fusion and decoupling to handle partial modality missing, using synergistic feature fusion and cross-modal/region decouple regularization.
Details
Motivation: Traditional RGB-Thermal segmentation approaches overemphasize modality balance, leading to limited robustness and severe performance degradation when sensor signals are partially missing. Existing methods decouple modality fusion and adaptation, requiring multi-stage training with frozen models or teacher-student frameworks.Method: Three-branch encoder-decoder architecture with: 1) Synergistic Feature Fusion (SFF) using channel-wise gated exchange and lightweight spatial attention; 2) Cross-Modal Decouple Regularization (CMDR) isolating modality-specific components from fused representation; 3) Region Decouple Regularization (RDR) enforcing class-selective prediction consistency in confident regions while blocking gradients to fusion branch.
Result: Extensive experiments demonstrate effectiveness, showing consistent performance across varying modality conditions. The method enables efficient standalone inference at test time without degrading fused stream performance.
Conclusion: RTFDNet successfully unifies fusion and decoupling for robust RGB-T segmentation, handling partial modality missing while maintaining performance. The feedback loop strengthens unimodal paths without degrading fused stream.
Abstract: RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.
[131] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu
Main category: cs.CV
TL;DR: RubiCap: RL framework for dense image captioning using LLM-written rubrics to generate fine-grained, sample-specific rewards, achieving state-of-the-art performance across benchmarks.
Details
Motivation: Dense image captioning is essential for vision-language alignment but expert annotations are expensive. Synthetic captioning via VLMs has limited diversity and generalization. RL could help but lacks verifiable checkers in open-ended captioning tasks.Method: RubiCap uses LLM-written rubrics to derive fine-grained rewards. It assembles diverse caption committees, uses LLM rubric writers to extract consensus strengths and diagnose deficiencies, converts insights into explicit evaluation criteria, and employs LLM judges for structured multi-faceted evaluations.
Result: Achieves highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, shows superior word efficiency: 7B model matches Qwen2.5-VL-32B-Instruct, 3B model surpasses its 7B counterpart. RubiCap-3B as captioner produces stronger pretrained VLMs than those trained on proprietary model captions.
Conclusion: RubiCap successfully addresses RL limitations in open-ended captioning by leveraging LLM-written rubrics for fine-grained reward signals, enabling superior caption quality and efficiency while enhancing vision-language pretraining.
Abstract: Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers – a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
[132] Progressive Split Mamba: Effective State Space Modelling for Image Restoration
Mohammed Hassanin, Nour Moustafa, Weijian Deng, Ibrahim Radwan
Main category: cs.CV
TL;DR: PS-Mamba: A topology-aware hierarchical state-space framework for image restoration that maintains spatial locality while enabling efficient global propagation through progressive splitting and cross-scale shortcuts.
Details
Motivation: Image restoration requires balancing local structure preservation with long-range spatial coherence. Convolutional networks have limited receptive fields, Transformers have quadratic complexity, and naive 2D extensions of Mamba disrupt spatial topology and suffer from long-range decay, limiting high-fidelity restoration.Method: Progressive Split-Mamba (PS-Mamba) uses geometry-consistent partitioning (halves, quadrants, octants) instead of flattening entire feature maps, maintaining neighborhood integrity. It employs a progressive split hierarchy for multi-scale modeling with linear complexity, and introduces symmetric cross-scale shortcut pathways to counteract long-range decay by transmitting low-frequency global context across hierarchical levels.
Result: Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with clear margins.
Conclusion: PS-Mamba effectively reconciles locality preservation with efficient global propagation for image restoration, addressing the limitations of naive 2D Mamba extensions while maintaining linear complexity.
Abstract: Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.
[133] Point Cloud as a Foreign Language for Multi-modal Large Language Model
Sneha Paul, Zachary Patterson, Nizar Bouguila
Main category: cs.CV
TL;DR: SAGE is the first end-to-end 3D multimodal LLM that directly processes raw point clouds without pre-trained 3D encoders, using a lightweight tokenizer to treat 3D data as a foreign language and preference optimization for complex 3D reasoning.
Details
Motivation: Current 3D MLLMs rely on pre-trained 3D encoders which suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and computational overhead. There's a need for end-to-end approaches that directly process raw 3D data.Method: Introduces a lightweight 3D tokenizer combining geometric sampling and neighborhood aggregation with vector quantization to convert point clouds into discrete tokens. Uses preference optimization training with semantic alignment-based reward for open-ended 3D question answering.
Result: Outperforms existing encoder-based methods across diverse 3D understanding benchmarks while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations.
Conclusion: SAGE demonstrates that end-to-end 3D MLLMs can effectively process raw point clouds without pre-trained encoders, achieving better performance and efficiency while treating 3D data as a foreign language extension to LLMs.
Abstract: Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens–treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance the model’s reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: github.com/snehaputul/SAGE3D.
[134] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, Fuxiao Liu
Main category: cs.CV
TL;DR: MM-Zero is a reinforcement learning framework for zero-data self-evolution of vision-language models, using multi-role specialization (Proposer, Coder, Solver) trained with Group Relative Policy Optimization.
Details
Motivation: While LLMs can self-evolve from scratch with minimal data, VLMs require seed visual data. The paper aims to achieve zero-data self-evolution for VLM reasoning by moving beyond traditional dual-role setups.Method: Introduces MM-Zero with three specialized roles: Proposer (generates visual concepts/questions), Coder (translates concepts to executable code for image rendering), and Solver (multimodal reasoning). All roles initialized from same base model and trained using Group Relative Policy Optimization with execution feedback, visual verification, and difficulty balancing rewards.
Result: MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks, demonstrating effective zero-data self-evolution capabilities.
Conclusion: MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending self-improvement beyond conventional two-model paradigms.
Abstract: Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.
[135] Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints
Chayanin Chamachot, Kanokphan Lertniponphan
Main category: cs.CV
TL;DR: Geometry-aware metric learning using SO(3)-invariant hand joint angles improves cross-lingual few-shot sign language recognition by eliminating domain shift from camera variations.
Details
Motivation: Most sign languages lack sufficient annotated data for training recognition systems. Cross-lingual few-shot transfer offers a scalable solution, but conventional coordinate-based keypoint representations suffer from domain shift due to camera viewpoint, hand scale, and recording condition differences, which is particularly problematic in few-shot settings.Method: Proposes a geometry-aware metric-learning framework using a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating major sources of cross-dataset shift. Uses a lightweight MLP encoder with about 10^5 parameters.
Result: Evaluated on four fingerspelling alphabets (ASL, LIBRAS, Arabic Sign Language, Thai Sign Language). Angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain. Enables frozen cross-lingual transfer that frequently exceeds within-domain accuracy.
Conclusion: Invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot sign language recognition in low-resource settings, demonstrating robustness to domain shift.
Abstract: Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world’s 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.
[136] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy
Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu
Main category: cs.CV
TL;DR: TubeMLLM is a multimodal foundation model that integrates topological priors with visual representations for medical vessel-like anatomy, achieving state-of-the-art performance on topology-aware tasks with strong zero-shot generalization across modalities.
Details
Motivation: Medical vessel-like anatomy modeling is challenging due to intricate topology and dataset shifts, leading to topological inconsistencies in task-specific models. The authors aim to leverage multimodal large language models' zero-shot generalization capabilities to create a unified foundation model for medical vessel analysis.Method: Proposes TubeMLLM, which integrates topological priors through explicit natural language prompting and aligns them with visual representations using a shared-attention architecture. Also introduces TubeMData benchmark with topology-centric tasks and an adaptive loss weighting strategy to emphasize topology-critical regions.
Result: Achieves state-of-the-art out-of-distribution performance on 15 diverse datasets, reducing global topological discrepancies significantly (β₀ error from 37.42 to 8.58 on color fundus photography). Shows exceptional zero-shot cross-modality transfer (67.50% Dice score on X-ray angiography with β₀ error of 1.21). Maintains robustness against degradations and achieves 97.38% accuracy in topology-aware understanding tasks.
Conclusion: TubeMLLM successfully demonstrates that integrating topological priors with multimodal large language models enables superior topology-aware perception and controllable generation for medical vessel-like anatomy, with strong generalization capabilities across modalities and conditions.
Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $β_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $β_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.
[137] Distributed Convolutional Neural Networks for Object Recognition
Liang Sun
Main category: cs.CV
TL;DR: Proposes a novel loss function for training distributed convolutional neural networks (DisCNN) to recognize only specific positive classes by mapping positive samples to compact sets and negative samples to origin, enabling lightweight architecture and good generalization.
Details
Motivation: To create a lightweight neural network that can recognize only specific positive classes while ignoring negative classes, enabling efficient feature extraction and disentanglement of positive-class features from negative ones.Method: Uses a novel loss function that maps positive samples to a compact set in high-dimensional space and negative samples to the origin, training a distributed convolutional neural network (DisCNN) to extract only positive-class features.
Result: The model achieves excellent generalization on test data, remains effective for unseen classes, and enables straightforward object detection of positive samples in complex backgrounds due to its lightweight architecture.
Conclusion: The proposed DisCNN with novel loss function successfully disentangles positive-class features, creates a lightweight model with good generalization, and facilitates object detection in complex scenes.
Abstract: This paper proposes a novel loss function for training a distributed convolutional neural network (DisCNN) to recognize only a specific positive class. By mapping positive samples to a compact set in high-dimensional space and negative samples to Origin, the DisCNN extracts only the features of the positive class. An experiment is given to prove this. Thus, the features of the positive class are disentangled from those of the negative classes. The model has a lightweight architecture because only a few positive-class features need to be extracted. The model demonstrates excellent generalization on the test data and remains effective even for unseen classes. Finally, using DisCNN, object detection of positive samples embedded in a large and complex background is straightforward.
[138] UniField: A Unified Field-Aware MRI Enhancement Framework
Yiyang Lin, Chenhui Wang, Zhihao Peng, Yixuan Yuan
Main category: cs.CV
TL;DR: A unified MRI field-strength enhancement framework that leverages 3D foundation models and field-aware spectral rectification to improve generalization across different field strength transitions.
Details
Motivation: Existing MRI field-strength enhancement methods are limited to isolated tasks with specific field-strength transitions (e.g., 0.64T-to-3T or 3T-to-7T) using small datasets, failing to exploit shared degradation patterns across different field strengths and limiting model generalization.Method: Proposes a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning. Key innovations: 1) Uses pre-trained 3D foundation models to capture continuous anatomical structures instead of treating 3D volumes as independent 2D slices; 2) Introduces Field-Aware Spectral Rectification Mechanism (FASRM) to address spectral bias in flow-matching models by incorporating physical magnetic field mechanisms; 3) Releases a comprehensive paired multi-field MRI dataset.
Result: Extensive experiments show superiority over state-of-the-art approaches, achieving average improvements of approximately 1.81 dB in PSNR and 9.47% in SSIM.
Conclusion: The proposed unified framework effectively addresses limitations of existing MRI field-strength enhancement methods by leveraging 3D foundation models, field-aware spectral correction, and a comprehensive dataset, demonstrating significant performance improvements.
Abstract: Magnetic Resonance Imaging (MRI) field-strength enhancement holds immense value for both clinical diagnostics and advanced research. However, existing methods typically focus on isolated enhancement tasks, such as specific 64mT-to-3T or 3T-to-7T transitions using limited subject cohorts, thereby failing to exploit the shared degradation patterns inherent across different field strengths and severely restricting model generalization. To address this challenge, we propose \methodname, a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning by exploiting these shared degradation characteristics. Specifically, our main contributions are threefold. Firstly, to overcome MRI data scarcity and capture continuous anatomical structures, \methodname departs from conventional methods that treat 3D MRI volumes as independent 2D slices. Instead, we directly exploit comprehensive 3D volumetric information by leveraging pre-trained 3D foundation models, thereby embedding generalized and robust structural representations to significantly boost enhancement performance. In addition, to mitigate the spectral bias of mainstream flow-matching models that often over-smooth high-frequency details, we explicitly incorporate the physical mechanisms of magnetic fields to introduce a Field-Aware Spectral Rectification Mechanism (FASRM), tailoring customized spectral corrections to distinct field strengths. Finally, to resolve the fundamental data bottleneck, we organize and publicly release a comprehensive paired multi-field MRI dataset, which is an order of magnitude larger than existing datasets. Extensive experiments demonstrate our method’s superiority over state-of-the-art approaches, achieving an average improvement of approximately 1.81 dB in PSNR and 9.47% in SSIM. Code will be released upon acceptance.
[139] HelixTrack: Event-Based Tracking and RPM Estimation of Propeller-like Objects
Radim Spetlik, Michal Pliska, Vojtěch Vrba, Jiri Matas
Main category: cs.CV
TL;DR: HelixTrack: Event-based method for microsecond-latency tracking and RPM estimation of propeller-like objects, addressing limitations of frame-based trackers on periodic motion.
Details
Motivation: Existing frame-based and event-based trackers fail on propeller tracking due to periodic motion violating smooth-motion assumptions, creating a gap for safety-critical perception in UAVs and rotating machinery.Method: Fully event-driven approach that back-warps events from image plane to rotor plane via estimated homography, uses Kalman Filter for phase estimation, and performs batched iterative updates coupling phase residuals to geometry for pose refinement.
Result: Processes events at ~11.8x real time with microsecond latency, outperforms per-event and aggregation-based baselines on the new TQE dataset containing 52 rotating objects with ground truth.
Conclusion: HelixTrack enables robust tracking and RPM estimation of fast periodic motion, addressing a critical gap in safety-critical perception systems.
Abstract: Safety-critical perception for unmanned aerial vehicles and rotating machinery requires microsecond-latency tracking of fast, periodic motion under egomotion and strong distractors. Frame-based and event-based trackers drift or break on propellers because periodic signatures violate their smooth-motion assumptions. We tackle this gap with HelixTrack, a fully event-driven method that jointly tracks propeller-like objects and estimates their rotations per minute (RPM). Incoming events are back-warped from the image plane into the rotor plane via a homography estimated on the fly. A Kalman Filter maintains instantaneous estimates of phase. Batched iterative updates refine the object pose by coupling phase residuals to geometry. To our knowledge, no public dataset targets joint tracking and RPM estimation of propeller-like objects. We therefore introduce the Timestamped Quadcopter with Egomotion (TQE) dataset with 13 high-resolution event sequences, containing 52 rotating objects in total, captured at distances of 2 m / 4 m, with increasing egomotion and microsecond RPM ground truth. On TQE, HelixTrack processes full-rate events (approx. 11.8x real time) faster than real time and microsecond latency. It consistently outperforms per-event and aggregation-based baselines adapted for RPM estimation.
[140] BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off
Shuang Liu, Ao Yu, Linkang Cheng, Xiwen Huang, Li Zhao, Junhui Liu, Zhiting Lin, Yu Liu
Main category: cs.CV
TL;DR: BridgeDiff is a diffusion-based framework for virtual try-off that bridges human-centric observations with flat-garment synthesis using garment-cue representations and structural constraints.
Details
Motivation: Prior methods treat virtual try-off as direct image translation using local masks or text prompts, which overlooks the gap between on-body appearances and flat layouts, leading to inconsistent completion in unobserved regions and unstable garment structure.Method: BridgeDiff uses two components: 1) Garment Condition Bridge Module (GCBM) that builds garment-cue representations capturing global appearance and semantic identity, and 2) Flat Structure Constraint Module (FSCM) that injects explicit flat-garment structural priors via Flat-Constraint Attention at selected denoising stages.
Result: Extensive experiments on standard VTOFF benchmarks show BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
Conclusion: BridgeDiff effectively bridges the gap between human-centric observations and flat-garment synthesis through complementary garment-cue representations and structural constraints, improving virtual try-off quality.
Abstract: Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
[141] RAE-NWM: Navigation World Model in Dense Visual Representation Space
Mingkun Zhang, Wangtian Shen, Fan Zhang, Haijian Qin, Zihao Pei, Ziyang Meng
Main category: cs.CV
TL;DR: RAE-NWM: A navigation world model that uses dense DINOv2 features instead of compressed latents for better structural preservation and action-conditioned transition modeling via conditional diffusion transformers.
Details
Motivation: Current navigation world models use compressed latent spaces from VAEs that lose fine-grained structural information, hindering precise control. The authors found that dense DINOv2 features have better linear predictability for action-conditioned transitions, motivating their use for navigation dynamics modeling.Method: Proposes Representation Autoencoder-based Navigation World Model (RAE-NWM) that models navigation dynamics in dense visual representation space. Uses Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) for continuous transitions, and a time-driven gating module for dynamics conditioning to regulate action injection during generation.
Result: Extensive evaluations show that modeling sequential rollouts in dense representation space improves structural stability and action accuracy, benefiting downstream planning and navigation tasks.
Conclusion: Dense visual representations (like DINOv2 features) provide better foundations for navigation world models than compressed latent spaces, enabling more precise control and improved planning through better structural preservation.
Abstract: Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
[142] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
Yueen Ma, Irwin King
Main category: cs.CV
TL;DR: X-GS is an extensible framework that unifies real-time 3D Gaussian Splatting for online SLAM with semantic enrichment, enabling multimodal vision-language applications.
Details
Motivation: Most existing 3DGS methods are isolated and domain-specific, lacking integration between geometry reconstruction, pose estimation, and semantic understanding needed for multimodal AI applications.Method: X-GS-Perceiver pipeline processes unposed RGB/RGB-D video to co-optimize geometry and poses, distills semantic features from vision foundation models into 3D Gaussians using online Vector Quantization, GPU-accelerated grid-sampling, and parallelized design.
Result: Achieves real-time performance on real-world datasets, enabling semantic 3D Gaussians that can be used by vision-language models for object detection, zero-shot caption generation, and embodied tasks.
Conclusion: X-GS provides a unified, extensible framework that bridges 3D reconstruction with multimodal AI capabilities, unlocking new possibilities for spatial AI applications.
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
[143] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection
Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui Ren
Main category: cs.CV
TL;DR: A method called Geometric Semantic Decoupling (GSD) improves AI-generated image detection by removing semantic priors from VFM-based detectors to focus on forgery-specific traces, enhancing generalization to unseen generation pipelines.
Details
Motivation: Vision Foundation Models (VFMs) like CLIP struggle to generalize AI-generated image detection to unseen generation pipelines due to "semantic fallback" - they rely on dominant pre-trained semantic priors rather than forgery-specific traces under distribution shifts.Method: Proposes Geometric Semantic Decoupling (GSD), a parameter-free module that explicitly removes semantic components from learned representations by using a frozen VFM as semantic guide and trainable VFM as artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via geometric constraint.
Result: Outperforms state-of-the-art approaches: achieves 94.4% video-level AUC (+1.2%) in cross-dataset evaluation, improves robustness to unseen manipulations (+3.0% on DF40), and generalizes beyond faces to synthetic images of general scenes (+0.9% on UniversalFakeDetect, +1.7% on GenImage).
Conclusion: GSD effectively addresses semantic fallback in VFM-based detectors by forcing artifact detectors to rely on semantic-invariant forensic evidence, significantly improving generalization to unseen AI-generated content across various domains.
Abstract: AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4% video-level AUC (+\textbf{1.2%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9%}) and GenImage (+\textbf{1.7%}).
[144] Towards Instance Segmentation with Polygon Detection Transformers
Jiacheng Sun, Jiaqi Lin, Wenlong Hu, Haoyang Li, Xinghong Zhou, Chenghai Mao, Yan Peng, Xiaomao Li
Main category: cs.CV
TL;DR: Poly-DETR reformulates instance segmentation as sparse vertex regression using polar representation instead of dense mask prediction, achieving better efficiency and performance for regular-shaped objects.
Details
Motivation: Address the bottleneck in instance segmentation where high-resolution inputs conflict with lightweight, real-time inference requirements by moving away from dense pixel-wise mask prediction.Method: Proposes Polygon Detection Transformer (Poly-DETR) that formulates instance segmentation as sparse vertex regression via Polar Representation. Introduces Polar Deformable Attention and Position-Aware Training Scheme to handle box-to-polygon reference shift and focus on boundary cues.
Result: Achieves 4.7 mAP improvement over state-of-the-art polar-based methods on MS COCO test-dev. Reduces memory consumption by almost half on Cityscapes dataset. Outperforms mask-based counterpart on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets.
Conclusion: Poly-DETR provides a more lightweight and efficient approach to instance segmentation, especially beneficial for regular-shaped instances in domain-specific settings, with significant advantages in high-resolution scenarios.
Abstract: One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.
[145] Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning
Kanishkha Jaisankar, Pranav M. Pawar, Diana Susane Joseph, Raja Muthalagu, Mithun Mukherjee
Main category: cs.CV
TL;DR: A review paper on using deep learning and computer vision techniques for self-driving cars, focusing on traffic sign classification, vehicle detection, lane detection, and behavioral cloning using neural networks.
Details
Motivation: To enhance the performance of self-driving cars by applying deep learning and computer vision techniques to enable better perception and understanding of the environment for safe navigation and decision-making.Method: Uses pre-trained and custom neural networks with techniques like data augmentation (geometric/color transformations), image normalization, and transfer learning for feature extraction. Applied to datasets including GTSRB, road/lane segmentation, vehicle detection datasets, and Udacity simulator data.
Result: The approach effectively solves challenges in traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, providing insights for improving robustness and reliability of autonomous systems.
Conclusion: The work reviews state-of-the-art deep learning and computer vision for self-driving cars and paves the way for future research and deployment of safer, more efficient autonomous technologies.
Abstract: Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.
[146] EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
Chengjun Yu, Xuhan Zhu, Chaoqun Du, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Main category: cs.CV
TL;DR: EXPLORE-Bench: A benchmark for evaluating MLLMs’ ability to predict final scenes from initial images and long action sequences, revealing significant gaps in long-horizon egocentric reasoning.
Details
Motivation: While MLLMs are considered for embodied agents, it's unclear if they can reliably reason about long-term physical consequences of actions from an egocentric viewpoint. The paper aims to study this gap through systematic evaluation.Method: Introduces EXPLORE-Bench, a benchmark curated from real first-person videos with diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations (object categories, visual attributes, inter-object relations) for fine-grained quantitative assessment.
Result: Experiments on proprietary and open-source MLLMs show significant performance gap compared to humans, indicating long-horizon egocentric reasoning remains a major challenge. Stepwise reasoning improves performance but incurs computational overhead.
Conclusion: EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception, highlighting current limitations in MLLMs for embodied applications.
Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
[147] Multimodal Graph Representation Learning with Dynamic Information Pathways
Xiaobin Hong, Mingkai Lin, Xiaoli Wang, Chaoqun Wang, Wenzhong Li
Main category: cs.CV
TL;DR: DiP is a multimodal graph learning framework that uses modality-specific pseudo nodes to enable dynamic message routing within modalities and efficient inter-modal aggregation with linear complexity.
Details
Motivation: Existing multimodal graph learning approaches are limited by static structures or dense attention mechanisms, which restrict flexibility and expressive node embedding learning for heterogeneous features like images and text.Method: Introduces modality-specific pseudo nodes that enable dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space.
Result: Extensive experiments across multiple benchmarks show DiP consistently outperforms baselines on link prediction and node classification tasks.
Conclusion: DiP achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity, providing an effective framework for multimodal graph representation learning.
Abstract: Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.
[148] Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev
Main category: cs.CV
TL;DR: A large-scale video-instruction framework for Vision-and-Language Navigation using web-based room tour videos with implicit geometry representations to improve data utilization and performance.
Details
Motivation: Overcome limitations of simulator-curated VLN datasets that lack diversity and scalability, and fail to capture real-world complexity.Method: Create framework from web-based room tour videos with both description-enriched and action-enriched trajectories reconstructed in 3D, plus implicit geometry representations that extract spatial cues directly from RGB frames without full 3D reconstruction.
Result: Sets new state-of-the-art performance across multiple VLN benchmarks (CVDN, SOON, R2R, REVERIE) and enables robust zero-shot navigation agents.
Conclusion: Bridges large-scale web videos with implicit spatial reasoning to advance embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
Abstract: Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
[149] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
Junhao Cai, Deyu Zeng, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong
Main category: cs.CV
TL;DR: ForgeDreamer: A text-to-3D generation framework for industrial applications that addresses domain adaptation and geometric reasoning limitations through Multi-Expert LoRA Ensemble and Cross-View Hypergraph Geometric Enhancement.
Details
Motivation: Current text-to-3D generation methods work well for natural scenes but fail in industrial applications due to domain adaptation challenges (LoRA fusion causes knowledge interference across categories) and geometric reasoning deficiencies (pairwise consistency constraints can't capture higher-order structural dependencies needed for precision manufacturing).Method: Two key innovations: 1) Multi-Expert LoRA Ensemble consolidates multiple category-specific LoRA models into unified representation for superior cross-category generalization without knowledge interference; 2) Cross-View Hypergraph Geometric Enhancement captures structural dependencies across multiple viewpoints simultaneously, building on enhanced semantic understanding.
Result: Extensive experiments on custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches.
Conclusion: The proposed framework addresses critical limitations in industrial text-to-3D generation through synergistic components that improve semantic understanding and enable effective geometric reasoning with manufacturing-level consistency.
Abstract: Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.
[150] Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
Jiaqi Liu, Zhizhong Han
Main category: cs.CV
TL;DR: Novel training strategies and losses for 3D Gaussian Splatting that improve efficiency by shortening Gaussian lists per pixel through scale regularization and entropy constraints.
Details
Motivation: While 3D Gaussian Splatting (3DGS) shows advantages over NeRF in rendering quality and efficiency, there remains a challenge to further improve the efficiency of learning 3D Gaussians. The authors aim to reduce computational cost by shortening the Gaussian lists used to render each pixel.Method: Two main techniques: 1) Regular scale resetting to shrink Gaussian sizes, encouraging smaller Gaussians to cover fewer nearby pixels; 2) Entropy constraint on alpha blending to sharpen weight distribution along rays, making dominant weights larger and minor weights smaller. Also integrates with rendering resolution scheduler for progressive resolution increase.
Result: The method shows significant advantages over state-of-the-art methods in efficiency without sacrificing rendering quality on widely used benchmarks.
Conclusion: The proposed training strategies and losses effectively improve 3DGS efficiency by reducing Gaussian list lengths per pixel while maintaining rendering quality, making 3DGS more practical for real-time applications.
Abstract: 3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.
[151] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan
Main category: cs.CV
TL;DR: SVOR is a robust diffusion-based framework for video object removal that handles real-world challenges like shadows, abrupt motion, and defective masks through three key innovations: MUSE for stable erasure, DA-Seg for diffusion-aware localization, and curriculum two-stage training.
Details
Motivation: Existing diffusion-based video inpainting models struggle with temporal stability and visual consistency when dealing with real-world imperfections like shadows, abrupt motion, and defective masks. There's a need for more robust video object removal that works in practical, non-ideal conditions.Method: Three key designs: 1) MUSE (Mask Union for Stable Erasure) - windowed union strategy during temporal mask downsampling to preserve target regions; 2) DA-Seg (Denoising-Aware Segmentation) - lightweight segmentation head with Denoising-Aware AdaLN trained with mask degradation; 3) Curriculum Two-Stage Training - self-supervised pretraining on unpaired real-background videos followed by refinement on synthetic pairs with mask degradation and side-effect-weighted losses.
Result: SVOR achieves state-of-the-art results across multiple datasets and degraded-mask benchmarks, demonstrating robust performance in handling shadows, abrupt motion, and defective masks while maintaining temporal stability and visual consistency.
Conclusion: SVOR advances video object removal from ideal settings toward real-world applications by providing a robust framework that handles various real-world imperfections through its three innovative components, making it more practical for actual use cases.
Abstract: Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
[152] Learning Convex Decomposition via Feature Fields
Yuezhi Yang, Qixing Huang, Mikaela Angelina Uy, Nicholas Sharp
Main category: cs.CV
TL;DR: Learning-based convex decomposition of 3D shapes via continuous feature fields that can be clustered to produce high-quality convex decompositions, enabling first feed-forward model for open-world convex decomposition.
Details
Motivation: Convex decomposition is essential for accelerating collision detection in physical simulation and other applications, but existing methods lack open-world generalization and require optimization per shape. The goal is to create the first learned model that can perform convex decomposition on arbitrary 3D shapes without per-shape optimization.Method: Proposes learning continuous feature fields that can be clustered to yield convex decompositions. Uses a self-supervised, purely-geometric objective derived from the classical definition of convexity. The feature learning approach enables both single shape optimization and scalable self-supervised learning on large datasets.
Result: Produces higher-quality decompositions than alternatives and generalizes across open-world objects as well as across different 3D representations including meshes, CAD models, and Gaussian splats.
Conclusion: The method enables the first feed-forward model for open-world convex decomposition, with applications in physical simulation acceleration and other domains requiring efficient 3D shape decomposition.
Abstract: This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world model for convex decomposition. Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats. https://research.nvidia.com/labs/sil/projects/learning-convex-decomp/
[153] CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation
Shengqi Dang, Jiaying Lei, Yi He, Ziqing Qian, Nan Cao
Main category: cs.CV
TL;DR: CogBlender enables continuous multi-dimensional control of cognitive properties (valence, arousal, dominance, memorability) in text-to-image generation by mapping cognitive space to semantic manifold and interpolating velocity fields.
Details
Motivation: Current text-to-image models excel at semantic coherence but lack control over cognitive properties like emotional response and memorability, limiting their ability to align with specific psychological intent in creative design.Method: Framework maps Cognitive Space to Semantic Manifold, defines Cognitive Anchors as boundary points, and reformulates velocity field in flow-matching process by interpolating from different anchors’ velocity fields, enabling dynamic steering by cognitive scores.
Result: Validated across four cognitive dimensions (valence, arousal, dominance, memorability), achieving effective cognitive intervention in text-to-image generation.
Conclusion: Provides effective paradigm for cognition-driven creative design by enabling precise, fine-grained, continuous intervention of cognitive properties in image generation.
Abstract: Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.
[154] Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking
Shilei Wang, Pujian Lai, Dong Gao, Jifeng Ning, Gong Cheng
Main category: cs.CV
TL;DR: MDTrack is a multimodal object tracking framework with modality-aware fusion using dedicated experts and decoupled temporal propagation via separate State Space Models for different modalities.
Details
Motivation: Existing multimodal trackers use uniform fusion strategies that ignore modality differences and propagate temporal information through mixed tokens, resulting in entangled and less discriminative temporal representations.Method: Proposes MDTrack with two key components: 1) Modality-aware fusion using dedicated experts for each modality (infrared, event, depth, RGB) with gating mechanism for adaptive expert selection; 2) Decoupled temporal propagation using separate State Space Model structures for RGB and other modalities, with cross-attention modules for information exchange.
Result: MDTrack S and MDTrack U achieve state-of-the-art performance across five multimodal tracking benchmarks, demonstrating the effectiveness of the proposed approach.
Conclusion: The proposed modality-aware fusion and decoupled temporal propagation framework effectively addresses limitations of existing multimodal trackers, leading to superior tracking performance across multiple benchmarks.
Abstract: Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack’s ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.
[155] DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction
Fuzhen Jiang, Zhuoran Li, Yinlin Zhang
Main category: cs.CV
TL;DR: DenoiseSplat: A feed-forward 3D Gaussian splatting method for noisy multi-view images that outperforms baselines on a large-scale noisy-clean benchmark.
Details
Motivation: Most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts, creating a need for robust 3D reconstruction from noisy multi-view images.Method: Proposes DenoiseSplat using a lightweight MVSplat-style feed-forward backbone trained end-to-end with only clean 2D renderings as supervision (no 3D ground truth). Builds a large-scale scene-consistent noisy-clean benchmark on RE10K with various noise types.
Result: Outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels on the noisy RE10K benchmark.
Conclusion: DenoiseSplat effectively handles noisy multi-view inputs for 3D scene reconstruction and novel-view synthesis, demonstrating robustness to various noise types without requiring 3D ground truth.
Abstract: 3D scene reconstruction and novel-view synthesis are fundamental for VR, robotics, and content creation. However, most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts. We therefore propose DenoiseSplat, a feed-forward 3D Gaussian splatting method for noisy multi-view images. We build a large-scale, scene-consistent noisy–clean benchmark on RE10K by injecting Gaussian, Poisson, speckle, and salt-and-pepper noise with controlled intensities. With a lightweight MVSplat-style feed-forward backbone, we train end-to-end using only clean 2D renderings as supervision and no 3D ground truth. On noisy RE10K, DenoiseSplat outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels.
[156] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework
Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu Gao
Main category: cs.CV
TL;DR: IntroSVG: A framework for text-to-SVG generation using a unified VLM that acts as both generator and critic in a closed-loop “generate-review-refine” cycle with visual feedback.
Details
Motivation: Existing text-to-SVG generation methods are limited because autoregressive training doesn't incorporate visual perception of the final rendered image, constraining generation quality. There's a need to integrate visual feedback into the generation process.Method: Proposes Introspective SVG Generation Framework (IntroSVG) with a unified VLM operating in closed loop as both generator and critic. Uses SFT to learn SVG drafting and feedback on rendered outputs, converts failures into error-correction training data, then uses DPO with teacher VLM for policy alignment. Inference uses iterative “generate-review-refine” cycle.
Result: Achieves state-of-the-art performance across key metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. Demonstrates effectiveness of incorporating visual feedback.
Conclusion: The closed-loop framework with visual feedback significantly improves SVG generation quality, enabling more complex, semantically aligned, and editable outputs through iterative refinement.
Abstract: Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator’s policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative “generate-review-refine” cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
[157] CLoE: Expert Consistency Learning for Missing Modality Segmentation
Xinyu Tong, Meihua Zhou, Bowu Fan, Haitao Li
Main category: cs.CV
TL;DR: CLoE is a consistency-driven framework for robust multimodal medical image segmentation that handles missing modalities by enforcing expert agreement and reliability-aware fusion.
Details
Motivation: Multimodal medical image segmentation suffers from missing modalities at inference, causing expert disagreement and unstable fusion, especially for small foreground structures. Existing methods struggle with maintaining performance when modalities are missing.Method: Proposes Consistency Learning of Experts (CLoE) with dual-branch Expert Consistency Learning: Modality Expert Consistency for global agreement among expert predictions, and Region Expert Consistency for agreement on clinically critical foreground regions. Also uses a lightweight gating network to map consistency scores to modality reliability weights for feature recalibration before fusion.
Result: CLoE outperforms state-of-the-art methods on BraTS 2020 and MSD Prostate datasets for incomplete multimodal segmentation, shows strong cross-dataset generalization, and improves robustness on clinically critical structures.
Conclusion: CLoE effectively addresses missing modality challenges in medical image segmentation through consistency-driven expert learning and reliability-aware fusion, demonstrating superior performance and robustness.
Abstract: Multimodal medical image segmentation often faces missing modalities at inference, which induces disagreement among modality experts and makes fusion unstable, particularly on small foreground structures. We propose Consistency Learning of Experts (CLoE), a consistency-driven framework for missing-modality segmentation that preserves strong performance when all modalities are available. CLoE formulates robustness as decision-level expert consistency control and introduces a dual-branch Expert Consistency Learning objective. Modality Expert Consistency enforces global agreement among expert predictions to reduce case-wise drift under partial inputs, while Region Expert Consistency emphasizes agreement on clinically critical foreground regions to avoid background-dominated regularization. We further map consistency scores to modality reliability weights using a lightweight gating network, enabling reliability-aware feature recalibration before fusion. Extensive experiments on BraTS 2020 and MSD Prostate demonstrate that CLoE outperforms state-of-the-art methods in incomplete multimodal segmentation, while exhibiting strong cross-dataset generalization and improving robustness on clinically critical structures.
[158] SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation
Aodi Wu, Jianhong Zuo, Zeyuan Zhao, Xubo Luo, Ruisuo Wang, Xue Wan
Main category: cs.CV
TL;DR: SpaceSense-Bench: A large-scale multi-modal benchmark for spacecraft perception with 136 satellite models, providing RGB images, depth maps, LiDAR point clouds, part-level semantic labels, and 6-DoF pose ground truth for advancing autonomous space operations.
Details
Motivation: Autonomous space operations require robust part-level semantic understanding and precise relative navigation of spacecraft, but collecting real orbital data is impractical due to cost and access constraints. Existing synthetic datasets have limited diversity, single-modality sensing, and incomplete annotations.Method: Created a large-scale multi-modal benchmark using high-fidelity space simulation in Unreal Engine 5 with an automated pipeline covering data acquisition, multi-stage quality control, and format conversion. Includes 136 satellite models with synchronized RGB images, depth maps, LiDAR point clouds, and comprehensive ground truth annotations.
Result: Benchmarked five representative tasks (object detection, 2D semantic segmentation, RGB-LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation). Found that perceiving small-scale components and generalizing to unseen spacecraft remain critical bottlenecks, while scaling training data yields substantial performance gains.
Conclusion: SpaceSense-Bench provides a valuable large-scale, diverse dataset for space perception research, highlighting the importance of multi-modal data and diverse training samples for improving generalization to novel spacecraft targets in autonomous space operations.
Abstract: Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136satellite models with approximately 70GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB–LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.
[159] OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming
Main category: cs.CV
TL;DR: OddGridBench: A benchmark for evaluating visual discrepancy sensitivity in MLLMs, revealing poor performance compared to humans, with OddGrid-GRPO RL framework to improve fine-grained visual discrimination.
Details
Motivation: MLLMs have shown strong performance on vision-language tasks but their low-level visual perception, especially fine-grained visual discrepancy detection, remains underexplored and lacks systematic analysis.Method: Created OddGridBench with 1,400+ grid-based images where one element differs in visual attributes (color, size, rotation, position). Proposed OddGrid-GRPO reinforcement learning framework with curriculum learning and distance-aware reward to progressively train models on harder samples with spatial proximity constraints.
Result: All evaluated MLLMs (Qwen3-VL, InternVL3.5, Gemini-2.5-Pro, GPT-5) performed far below human levels in visual discrepancy detection. OddGrid-GRPO significantly enhanced models’ fine-grained visual discrimination ability.
Conclusion: OddGridBench and OddGrid-GRPO provide groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence, addressing a critical gap in MLLM capabilities.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model’s fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.
[160] Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments
Yang Li, Xing Chen, Yutao Liu, Gege Qi, Yanxian BI, Zizhe Wang, Yunjian Zhang, Yao Zhu
Main category: cs.CV
TL;DR: STAR Benchmark evaluates LLMs as interactive agents in competitive 1v1 environments with both turn-based and real-time settings, revealing a strategy-execution gap where reasoning models excel in turn-based but lag in real-time due to latency.
Details
Motivation: Current LLM evaluations focus on static reasoning, overlooking interactive decision-making in adversarial, time-sensitive environments. There's a need to assess LLMs as agents that must adapt to opponents, handle temporal constraints, and execute under pressure.Method: Introduces STAR Benchmark: a multi-agent evaluation framework with 1v1 zero-sum competitive interactions. Features modular architecture with standardized API and execution engine, supporting both turn-based and real-time settings. Includes Strategic Evaluation Suite to assess not just win-loss but also strategic behavior quality like execution efficiency and outcome stability.
Result: Extensive evaluations show a strategy-execution gap: reasoning-intensive models dominate turn-based settings but suffer in real-time scenarios due to inference latency, while faster instruction-tuned models perform better in time-sensitive environments.
Conclusion: Strategic intelligence in interactive environments requires both reasoning depth and timely action execution. STAR provides a principled benchmark for studying this trade-off in competitive, dynamic settings.
Abstract: Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.
[161] Predictive Spectral Calibration for Source-Free Test-Time Regression
Nguyen Viet Tuan Kiet, Huynh Thanh Trung, Pham Huy Hieu
Main category: cs.CV
TL;DR: PSC extends subspace alignment for test-time adaptation in image regression through predictive spectral calibration, jointly aligning target features within source predictive support and calibrating residual spectral slack in orthogonal complement.
Details
Motivation: Test-time adaptation for image regression has received less attention than classification, with existing classification methods not directly transferable to continuous regression targets. Recent work shows subspace alignment can be effective, but there's room for improvement.Method: Proposes Predictive Spectral Calibration (PSC), a source-free framework that extends subspace alignment to block spectral matching. Instead of relying only on fixed support subspace, PSC jointly aligns target features within source predictive support and calibrates residual spectral slack in the orthogonal complement.
Result: Experiments on multiple image regression benchmarks show consistent improvements over strong baselines, with particularly clear gains under severe distribution shifts.
Conclusion: PSC provides an effective, model-agnostic approach for test-time adaptation in image regression that remains simple to implement and compatible with off-the-shelf pretrained regressors.
Abstract: Test-time adaptation (TTA) for image regression has received far less attention than its classification counterpart. Methods designed for classification often depend on classification-specific objectives and decision boundaries, making them difficult to transfer directly to continuous regression targets. Recent progress revisits regression TTA through subspace alignment, showing that simple source-guided alignment can be both practical and effective. Building on this line of work, we propose Predictive Spectral Calibration (PSC), a source-free framework that extends subspace alignment to block spectral matching. Instead of relying on a fixed support subspace alone, PSC jointly aligns target features within the source predictive support and calibrates residual spectral slack in the orthogonal complement. PSC remains simple to implement, model-agnostic, and compatible with off-the-shelf pretrained regressors. Experiments on multiple image regression benchmarks show consistent improvements over strong baselines, with particularly clear gains under severe distribution shifts.
[162] Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification
Junhyeok Lee, Minseo Choi, Han Jang, Young Hun Jeon, Heeseong Eum, Joon Jang, Chul-Ho Sohn, Kyu Sung Choi
Main category: cs.CV
TL;DR: EPPINN integrates evidential deep learning with physics-informed neural networks for uncertainty-aware perfusion parameter estimation in CT perfusion imaging, improving accuracy and reliability for stroke assessment.
Details
Motivation: Existing physics-informed neural network approaches for CT perfusion imaging are deterministic and lack uncertainty quantification, limiting reliability assessment for acute ischemic stroke diagnosis.Method: Proposes Evidential Perfusion Physics-Informed Neural Networks (EPPINN) that model arterial input, tissue concentration, and perfusion parameters using coordinate-based networks with Normal-Inverse-Gamma distribution over physics residuals for voxel-wise uncertainty quantification without Bayesian sampling.
Result: EPPINN achieves lower normalized mean absolute error than classical deconvolution and PINN baselines, particularly under sparse temporal sampling and low SNR conditions, with conservative uncertainty estimates and highest infarct-core detection sensitivity on clinical data.
Conclusion: Evidential physics-informed learning improves both accuracy and reliability of CT perfusion analysis for time-critical stroke assessment by providing uncertainty-aware parameter estimation.
Abstract: Physics-informed neural networks (PINNs) have shown promise in addressing the ill-posed deconvolution problem in computed tomography perfusion (CTP) imaging for acute ischemic stroke assessment. However, existing PINN-based approaches remain deterministic and do not quantify uncertainty associated with violations of physics constraints, limiting reliability assessment. We propose Evidential Perfusion Physics-Informed Neural Networks (EPPINN), a framework that integrates evidential deep learning with physics-informed modeling to enable uncertainty-aware perfusion parameter estimation. EPPINN models arterial input, tissue concentration, and perfusion parameters using coordinate-based networks, and places a Normal–Inverse–Gamma distribution over the physics residual to characterize voxel-wise aleatoric and epistemic uncertainty in physics consistency without requiring Bayesian sampling or ensemble inference. The framework further incorporates physiologically constrained parameterization and stabilization strategies to promote robust per-case optimization. We evaluate EPPINN on digital phantom data, the ISLES 2018 benchmark, and a clinical cohort. On the evaluated datasets, EPPINN achieves lower normalized mean absolute error than classical deconvolution and PINN baselines, particularly under sparse temporal sampling and low signal-to-noise conditions, while providing conservative uncertainty estimates with high empirical coverage. On clinical data, EPPINN attains the highest voxel-level and case-level infarct-core detection sensitivity. These results suggest that evidential physics-informed learning can improve both accuracy and reliability of CTP analysis for time-critical stroke assessment.
[163] M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition
Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai
Main category: cs.CV
TL;DR: M3GCLR is a game-theoretic contrastive learning framework for skeleton-based action recognition that addresses limitations in view discrepancy modeling, adversarial mechanisms, and augmentation control through multi-view mini-max optimization.
Details
Motivation: Existing self-supervised skeleton-based action recognition methods have three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. These issues hinder the learning of rich action-discriminative information from unlabeled skeleton data.Method: Proposes M3GCLR with four key components: 1) Infinite Skeleton-data Game (ISG) model with equilibrium theorem for mini-max optimization based on multi-view mutual information, 2) Multi-view rotation augmentation to generate normal-extreme data pairs with temporally averaged input as neutral anchor for structural alignment, 3) Strongly adversarial mini-max skeleton-data game to mine action-discriminative information, 4) Dual-loss equilibrium optimizer to maximize action-relevant information while minimizing encoding redundancy.
Result: Achieves state-of-the-art performance: 82.1% (X-Sub) and 85.8% (X-View) on NTU RGB+D 60; 72.3% (X-Sub) and 75.0% (X-Set) on NTU RGB+D 120; 89.1% on PKU-MMD Part I and 45.2% on Part II in three-stream configurations. Ablation studies confirm effectiveness of each component.
Conclusion: M3GCLR successfully addresses key limitations in self-supervised skeleton-based action recognition through a game-theoretic contrastive framework, achieving superior performance by explicitly modeling view discrepancies, introducing adversarial mechanisms, and controlling augmentation perturbations.
Abstract: In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.
[164] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification
Nikola Jovišić, Milica Škipina, Nicola Dall’Asen, Dubravko Ćulibrk
Main category: cs.CV
TL;DR: MIL-PF: A scalable framework using frozen foundation encoders with lightweight multiple instance learning for mammography classification, enabling efficient adaptation to high-resolution medical imaging without retraining large backbones.
Details
Motivation: Adapting modern foundation models to high-resolution medical imaging is challenging due to limited annotations, weak supervision, large image sizes, and computational constraints. Mammography specifically faces issues with large images, multi-view studies, and breast-level labels, making end-to-end fine-tuning impractical.Method: Proposes MIL-PF: combines frozen foundation encoders with a lightweight multiple instance learning head. Precomputes semantic representations from foundation models, then trains only a small task-specific aggregation module (40k parameters) using attention-based aggregation to model global tissue context and sparse local lesion signals.
Result: Achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. Enables efficient experimentation and adaptation without retraining large backbones.
Conclusion: MIL-PF provides a scalable framework for adapting foundation models to high-resolution medical imaging, particularly mammography, by leveraging precomputed features and lightweight MIL heads to overcome computational and annotation limitations.
Abstract: Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.
[165] SinGeo: Unlock Single Model’s Potential for Robust Cross-View Geo-Localization
Yang Chen, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu
Main category: cs.CV
TL;DR: SinGeo is a framework for robust cross-view geo-localization that uses dual discriminative learning and curriculum learning to handle varying field-of-view conditions with a single model.
Details
Motivation: Existing cross-view geo-localization methods fail when tested on unseen field-of-view (FoV) conditions and orientations, requiring multiple specialized models. Current approaches that randomize FoVs during training don't achieve robustness across diverse conditions.Method: SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, combined with a curriculum learning strategy to progressively handle more challenging FoV variations.
Result: SinGeo achieves state-of-the-art results on four benchmark datasets, outperforming methods specifically trained for extreme FoVs, and demonstrates cross-architecture transferability. The framework also includes a consistency evaluation method for model stability assessment.
Conclusion: SinGeo enables robust cross-view geo-localization with a single model across diverse field-of-view conditions, providing an explainable framework for advancing robustness in CVGL research.
Abstract: Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions – implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.
[166] EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation
Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, Hui Xiong
Main category: cs.CV
TL;DR: EventVGGT: A novel framework for event-based monocular depth estimation that treats event streams as coherent video sequences and distills spatio-temporal priors from Vision Foundation Models to achieve temporally consistent depth predictions.
Details
Motivation: Event cameras excel in high-speed motion and extreme lighting but lack dense depth annotations. Existing annotation-free methods process event streams as independent frames, ignoring temporal continuity and failing to leverage rich temporal priors from Vision Foundation Models, resulting in inconsistent depth predictions.Method: EventVGGT models event streams as video sequences and uses tri-level distillation: 1) Cross-Modal Feature Mixture (CMFM) fuses RGB and event features for auxiliary depth predictions, 2) Spatio-Temporal Feature Distillation (STFD) distills VGGT’s spatio-temporal representations, and 3) Temporal Consistency Distillation (TCD) enforces cross-frame coherence by aligning inter-frame depth changes.
Result: EventVGGT outperforms existing methods, reducing absolute mean depth error at 30m by over 53% on EventScape (from 2.30 to 1.06). It also shows robust zero-shot generalization on unseen DENSE and MVSEC datasets.
Conclusion: Treating event streams as coherent video sequences and distilling spatio-temporal priors from Vision Foundation Models significantly improves event-based depth estimation accuracy and temporal consistency, enabling robust performance in challenging conditions.
Abstract: Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT’s powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods – reducing the absolute mean depth error at 30m by over 53% on EventScape (from 2.30 to 1.06) – while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.
[167] Training-Free Coverless Multi-Image Steganography with Access Control
Minyeol Bae, Si-Hyeon Lee
Main category: cs.CV
TL;DR: MIDAS: A training-free diffusion-based framework for coverless image steganography with user-specific access control using latent-level fusion and random basis mechanisms.
Details
Motivation: Existing coverless image steganography methods lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users, which is critical for scalable and privacy-sensitive information hiding in multi-user settings.Method: Proposes MIDAS, a training-free diffusion-based CIS framework with Random Basis mechanism to suppress residual structural information and Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process for multi-image hiding with user-specific access control.
Result: MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis.
Conclusion: MIDAS establishes a practical and scalable approach to access-controlled coverless steganography, addressing the critical need for selective content revelation in multi-user settings.
Abstract: Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.
[168] ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts
Yaping Zhang, Yupu Liang, Zhiyang Zhang, Zhiyuan Chen, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong
Main category: cs.CV
TL;DR: The DIMT 2025 Challenge report describes a competition for Document Image Machine Translation, featuring OCR-free and OCR-based tracks for translating text in document images while preserving layout.
Details
Motivation: Advance research on end-to-end document image translation by bridging OCR and NLP, addressing the challenge of translating text embedded in document images while jointly modeling textual content and page layout.Method: Organized a competition with two tracks (OCR-free and OCR-based), each with subtasks for small (<1B parameters) and large (>1B parameters) models. Participants submitted unified DIMT systems, optionally incorporating provided OCR transcripts.
Result: Attracted 69 teams with 27 valid submissions total (34 teams/13 submissions for Track 1, 35 teams/14 submissions for Track 2). Large-model approaches established a promising new paradigm for translating complex-layout document images.
Conclusion: The challenge successfully advanced DIMT research, demonstrating that large-model approaches work well for complex-layout document translation and highlighting substantial opportunities for future research in this area.
Abstract: Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
[169] YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search
Zhe Li, Xiaoyu Ding, Jiaxin Zheng, Yongtao Wang
Main category: cs.CV
TL;DR: YOLO-NAS-Bench: First surrogate benchmark for YOLO-style object detection NAS with self-evolving mechanism to improve predictor accuracy for high-performance architectures.
Details
Motivation: NAS for object detection is bottlenecked by high evaluation costs (days per architecture), and existing NAS benchmarks focus on image classification, leaving detection without comparable benchmarks.Method: Define search space covering channel width, block depth, and operator types for YOLO detectors; sample 1,000 architectures via multiple strategies; train on COCO-mini; build LightGBM surrogate predictor with Self-Evolving Mechanism that progressively aligns predictor training with high-performance frontier.
Result: Self-evolving mechanism grows architecture pool to 1,500, improves predictor’s R² from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752; discovered architectures surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini.
Conclusion: YOLO-NAS-Bench provides effective surrogate benchmark for YOLO-style detection NAS with strong predictive accuracy and ranking consistency, enabling efficient architecture discovery.
Abstract: Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor’s training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor’s R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor’s discriminative power for top-performing detection architectures.
[170] Reviving ConvNeXt for Efficient Convolutional Diffusion Models
Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo
Main category: cs.CV
TL;DR: FCDM introduces a fully convolutional diffusion model using ConvNeXt backbone, achieving competitive performance with 50% FLOPs and 7x fewer training steps than DiT-XL, reviving ConvNets for efficient generative modeling.
Details
Motivation: Recent diffusion models favor Transformer backbones due to scalability, but overlook the locality bias, parameter efficiency, and hardware friendliness that made ConvNets efficient vision backbones. The authors aim to explore convolutional designs for efficient generative modeling.Method: Introduces FCDM (fully convolutional diffusion model) with a backbone similar to ConvNeXt, designed specifically for conditional diffusion modeling. Uses convolutional architectures instead of Transformers for better efficiency.
Result: FCDM-XL achieves competitive performance with DiT-XL/2 using only 50% FLOPs, requiring 7x fewer training steps at 256×256 resolution and 7.5x fewer at 512×512. Can be trained on a 4-GPU system, demonstrating exceptional training efficiency.
Conclusion: Modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness–the attributes that established ConvNets as the efficient vision backbone–have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
[171] RiO-DETR: DETR for Real-time Oriented Object Detection
Zhangchi Hu, Yifan Zhao, Yansong Peng, Wenzhang Sun, Xiangchen Yin, Jie Chen, Peixi Wu, Hebei Li, Xinghao Wang, Dongsheng Jiang, Xiaoyan Sun
Main category: cs.CV
TL;DR: RiO-DETR is the first real-time oriented object detection transformer that addresses challenges in adapting DETR to oriented bounding boxes while maintaining efficiency.
Details
Motivation: Adapting DETR to oriented bounding boxes faces three key challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and enlarged search space slowing convergence. Existing methods lack real-time oriented detection transformers.Method: 1) Content-Driven Angle Estimation decouples angle from positional queries with Rotation-Rectified Orthogonal Attention. 2) Decoupled Periodic Refinement combines bounded coarse-to-fine updates with Shortest-Path Periodic Loss. 3) Oriented Dense O2O injects angular diversity into dense supervision to speed convergence.
Result: Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed-accuracy trade-off for real-time oriented detection, achieving state-of-the-art performance.
Conclusion: RiO-DETR successfully addresses the challenges of adapting DETR to oriented object detection while maintaining real-time efficiency, setting a new benchmark for speed-accuracy trade-off in this domain.
Abstract: We present RiO-DETR: DETR for Real-time Oriented Object Detection, the first real-time oriented detection transformer to the best of our knowledge. Adapting DETR to oriented bounding boxes (OBBs) poses three challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and an enlarged search space that slows convergence. RiO-DETR resolves these issues with task-native designs while preserving real-time efficiency. First, we propose Content-Driven Angle Estimation by decoupling angle from positional queries, together with Rotation-Rectified Orthogonal Attention to capture complementary cues for reliable orientation. Second, Decoupled Periodic Refinement combines bounded coarse-to-fine updates with a Shortest-Path Periodic Loss for stable learning across angular seams. Third, Oriented Dense O2O injects angular diversity into dense supervision to speed up angle convergence at no extra cost. Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed–accuracy trade-off for real-time oriented detection. Code will be made publicly available.
[172] PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue
Zirui Zhang, Yaping Zhang, Lu Xiang, Yang Zhao, Feifei Zhai, Yu Zhou, Chengqing Zong
Main category: cs.CV
TL;DR: PromptDLA introduces a domain-aware prompter for document layout analysis that uses descriptive knowledge as cues to integrate domain priors, improving cross-domain generalization by customizing prompts based on specific data domain attributes.
Details
Motivation: Existing DLA approaches often combine data from various domains to improve generalization, but directly merging datasets leads to suboptimal performance due to overlooking different layout structures, labeling styles, document types, and languages across domains.Method: Proposes PromptDLA with a unique domain-aware prompter that customizes prompts based on specific data domain attributes. These prompts serve as cues that direct the DLA model toward critical features and structures within the data, effectively integrating domain priors into the analysis process.
Result: Extensive experiments show state-of-the-art performance on multiple benchmark datasets including DocLayNet, PubLayNet, M6Doc, and D$^4$LA, demonstrating improved cross-domain generalization capabilities.
Conclusion: PromptDLA effectively leverages descriptive knowledge as cues to integrate domain priors into document layout analysis, enhancing model generalization across varied domains through domain-aware prompting.
Abstract: Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model’s ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and D$^4$LA. Our code is available at https://github.com/Zirui00/PromptDLA.
[173] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Bohao Li, Zhicheng Cao, Huixian Li, Yangming Guo
Main category: cs.CV
TL;DR: CIGPose introduces a causal intervention framework for whole-body pose estimation that addresses spurious correlations from visual context using structural causal modeling and hierarchical graph neural networks.
Details
Motivation: Current whole-body pose estimators often produce anatomically implausible predictions due to learning spurious correlations from visual context, which acts as a confounder in the causal relationship between visual evidence and pose.Method: Proposes CIGPose framework with a Causal Intervention Module that identifies confounded keypoint representations via predictive uncertainty, replaces them with context-invariant canonical embeddings, and processes them through a hierarchical graph neural network that reasons at local and global semantic levels.
Result: Achieves state-of-the-art on COCO-WholeBody with 67.0% AP (CIGPose-x), surpassing prior methods without extra training data. With UBody dataset, reaches 67.5% AP, demonstrating superior robustness and data efficiency.
Conclusion: CIGPose effectively addresses confounder bias in pose estimation through causal intervention, leading to more anatomically plausible and robust predictions while setting new benchmarks in whole-body pose estimation.
Abstract: State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model’s reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5% AP, demonstrating superior robustness and data efficiency. The codes and models are publicly available at https://github.com/53mins/CIGPose.
[174] MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating
Yuning Wang, Pu Zhang, Yuan He, Ke Wang, Jianru Xue
Main category: cs.CV
TL;DR: Meta-learning framework for test-time adaptation in trajectory prediction with data-adaptive online updating mechanism
Details
Motivation: Existing trajectory prediction methods degrade under distribution shifts, and current test-time training approaches lack online learning flexibility and don't adapt to specific test data characteristicsMethod: 1) Meta-learning framework for fast online adaptation via bi-level optimization on simulated test-time tasks during pre-training; 2) Data-adaptive model updating mechanism that dynamically adjusts learning rates and updating frequencies based on online partial derivatives and hard sample selection
Result: Superior adaptation accuracy across nuScenes, Lyft, and Waymo datasets, surpassing state-of-the-art test-time training methods; robust under suboptimal learning rates and high FPS demands
Conclusion: Proposed meta-learning with data-adaptive updating enables effective online adaptation for trajectory prediction under distribution shifts, offering practical robustness
Abstract: Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a meta-learning framework to directly optimize the predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism enables the online learning rate to suit the test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior adaptation accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.
[175] Open-World Motion Forecasting
Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada
Main category: cs.CV
TL;DR: Open-world motion forecasting framework that handles evolving object classes and imperfect perception by combining pseudo-labeling, vision-language filtering, and replay sampling to prevent catastrophic forgetting.
Details
Motivation: Existing motion forecasting methods assume fixed object taxonomy and perfect perception, failing in real-world settings where perception is imperfect and object classes evolve over time.Method: End-to-end class-incremental framework with pseudo-labeling for known classes, vision-language model filtering of inconsistent predictions, and replay sampling based on query feature variance to preserve learned motion patterns.
Result: Successfully resists catastrophic forgetting on nuScenes and Argoverse 2 datasets, maintains performance on previously learned classes while adapting to novel ones, and enables zero-shot transfer to real-world driving.
Conclusion: Proposes first open-world motion forecasting framework that handles evolving object classes and imperfect perception, enabling continual adaptation of autonomous driving systems.
Abstract: Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de .
[176] GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis
Tran Bao Sam, Hung Vu, Dao Trung Kien, Tran Dat Dang, Van Ha Tang, Steven Truong
Main category: cs.CV
TL;DR: GIIM: A graph-based approach for medical CADx that models intra-view dependencies and inter-view dynamics while handling missing data, validated across CT, MRI, and mammography.
Details
Motivation: Current multi-view CADx systems fail to capture complex dependencies within single views and dynamic changes across different views, and struggle with incomplete data, reducing diagnostic reliability.Method: Proposes GIIM, a graph-based framework that simultaneously models intra-view dependencies between abnormalities and inter-view dynamics, with specific techniques to handle missing data.
Result: Extensive evaluations across diverse imaging modalities (CT, MRI, mammography) show GIIM significantly enhances diagnostic accuracy and robustness over existing methods.
Conclusion: GIIM establishes a more effective framework for future CADx systems by better modeling clinical interpretation processes and handling real-world data challenges.
Abstract: Computer-aided diagnosis (CADx) has become vital in medical imaging, but automated systems often struggle to replicate the nuanced process of clinical interpretation. Expert diagnosis requires a comprehensive analysis of how abnormalities relate to each other across various views and time points, but current multi-view CADx methods frequently overlook these complex dependencies. Specifically, they fail to model the crucial relationships within a single view and the dynamic changes lesions exhibit across different views. This limitation, combined with the common challenge of incomplete data, greatly reduces their predictive reliability. To address these gaps, we reframe the diagnostic task as one of relationship modeling and propose GIIM, a novel graph-based approach. Our framework is uniquely designed to simultaneously capture both critical intra-view dependencies between abnormalities and inter-view dynamics. Furthermore, it ensures diagnostic robustness by incorporating specific techniques to effectively handle missing data, a common clinical issue. We demonstrate the generality of this approach through extensive evaluations on diverse imaging modalities, including CT, MRI, and mammography. The results confirm that our GIIM model significantly enhances diagnostic accuracy and robustness over existing methods, establishing a more effective framework for future CADx systems.
[177] A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation
Yoon Jo Kim, Wonyoung Cho, Jongmin Lee, Han Joo Chae, Hyunki Park, Sang Hoon Seo, Noh Jae Myung, Kyungmi Yang, Dongryul Oh, Jin Sung Kim
Main category: cs.CV
TL;DR: OncoAgent is a training-free AI framework that converts textual clinical guidelines into 3D radiotherapy target contours, achieving performance comparable to supervised models while offering instant adaptability to guideline updates.
Details
Motivation: Current deep learning models for radiotherapy target delineation require costly retraining when clinical guidelines update, creating a need for more flexible, training-free approaches that can adapt to changing guidelines.Method: OncoAgent uses a guideline-aware AI agent framework that interprets textual clinical guidelines and generates 3D target contours without requiring training on annotated data, enabling zero-shot adaptation to different guidelines.
Result: Achieved zero-shot Dice scores of 0.842 for clinical target volume and 0.880 for planning target volume in esophageal cancer, comparable to supervised nnU-Net. Physicians preferred OncoAgent in blinded evaluation for guideline compliance and clinical acceptability.
Conclusion: The agent-based paradigm offers scalable, transparent, and interpretable radiotherapy planning with near-instantaneous adaptability to guideline changes, representing a significant advancement over traditional supervised approaches.
Abstract: Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.
[178] EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Wang Zijian, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou Liu, Yang Wang, Shanghang Zhang
Main category: cs.CV
TL;DR: EvoDriveVLA is a collaborative perception-planning distillation framework for autonomous driving that addresses degraded perception after visual encoder unfreezing and accumulated instability in long-term planning through self-anchored visual distillation and oracle-guided trajectory distillation.
Details
Motivation: Vision-Language-Action models for autonomous driving suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning, which limits their practical deployment.Method: Proposes a collaborative perception-planning distillation framework with two key components: 1) Self-anchored visual distillation using a self-anchor teacher to deliver visual anchoring constraints via trajectory-guided key-region awareness, and 2) Oracle-guided trajectory distillation using a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates.
Result: EvoDriveVLA achieves state-of-the-art performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation for autonomous driving tasks.
Conclusion: The proposed framework effectively addresses perception degradation and planning instability in Vision-Language-Action models for autonomous driving through collaborative distillation, demonstrating superior performance in both open-loop and closed-loop evaluations.
Abstract: Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student’s prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
[179] TopoOR: A Unified Topological Scene Representation for the Operating Room
Tony Danjun Wang, Ka Young Kim, Tolga Birdal, Nassir Navab, Lennart Bastian
Main category: cs.CV
TL;DR: TopoOR introduces a higher-order topological representation for surgical operating rooms that preserves complex multimodal relationships beyond traditional dyadic scene graphs, using higher-order attention mechanisms for surgical reasoning tasks.
Details
Motivation: Existing surgical scene graph methods suffer from dyadic structural limitations that flatten complex relational geometry and lose multimodal structure, making them inadequate for safety-critical surgical reasoning that requires precise modeling of 3D geometry, audio, and robot kinematics interactions.Method: Proposes TopoOR paradigm that models multimodal ORs as higher-order topological structures using topological cells to represent pairwise and group relationships. Introduces higher-order attention mechanism that preserves manifold structure and modality-specific features throughout hierarchical relational attention, avoiding joint latent representation flattening.
Result: Extensive experiments show TopoOR outperforms traditional graph and LLM-based baselines across multiple surgical reasoning tasks including sterility breach detection, robot phase prediction, and next-action anticipation.
Conclusion: TopoOR offers a more expressive topological representation that subsumes traditional scene graphs, enabling better modeling of complex multimodal surgical dynamics while preserving the precise structure needed for safety-critical reasoning.
Abstract: Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation
[180] The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions
Chahan Vidal-Gorène, Bastien Kindt
Main category: cs.CV
TL;DR: First large-scale OCR resource for 19th-century Ancient Greek texts from Patrologia Graeca, achieving state-of-the-art accuracy on complex bilingual layouts with degraded polytonic Greek typography.
Details
Motivation: To digitize the remaining undigitized volumes of Patrologia Graeca, which contain complex bilingual (Greek-Latin) layouts and highly degraded polytonic Greek typography that existing OCR systems struggle with.Method: Developed a dedicated pipeline combining YOLO-based layout detection for identifying text regions and CRNN-based text recognition for character recognition, specifically optimized for polytonic Greek.
Result: Achieved character error rate (CER) of 1.05% and word error rate (WER) of 4.69%, outperforming existing OCR systems for polytonic Greek. Created corpus with ~6 million lemmatized and POS-tagged tokens with full OCR and layout annotations.
Conclusion: Establishes new benchmark for OCR on noisy polytonic Greek, provides valuable philological resource, and offers training material for future models including LLMs.
Abstract: We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
[181] OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks
Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo Yang
Main category: cs.CV
TL;DR: OmniEarth is a comprehensive benchmark for evaluating remote sensing vision-language models across perception, reasoning, and robustness dimensions with 28 tasks covering diverse geospatial contexts and multi-source sensing data.
Details
Motivation: While VLMs show promise for Earth observation, there's no systematic benchmark to evaluate remote sensing vision-language models under realistic scenarios, limiting progress in this domain.Method: Created OmniEarth benchmark with 28 fine-grained tasks organized along perception, reasoning, and robustness dimensions. Includes 9,275 quality-controlled images (including Jilin-1 satellite data) and 44,210 verified instructions. Supports multiple-choice VQA, open-ended VQA with text, bounding box, and mask outputs. Uses blind test protocol and semantic consistency requirements to reduce linguistic bias.
Result: Evaluation of contrastive learning models, general VLMs, and RSVLMs shows existing models struggle with geospatially complex tasks, revealing significant gaps for remote sensing applications.
Conclusion: OmniEarth provides a needed benchmark for RSVLM evaluation, highlighting current limitations and enabling future research in Earth observation vision-language modeling.
Abstract: Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.
[182] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
Main category: cs.CV
TL;DR: PruneSID is a training-free approach for compressing visual tokens in vision-language models by clustering tokens into semantic groups and pruning redundancies while preserving key information, achieving high accuracy with extreme compression ratios.
Details
Motivation: Vision-language models suffer from computational inefficiencies due to excessive visual token generation, with existing compression methods struggling to balance importance preservation and information diversity.Method: Two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) clusters tokens into semantically coherent groups for comprehensive concept coverage, (2) Intra-group Non-Maximum Suppression (NMS) prunes redundant tokens while preserving key representatives within each group, plus an information-aware dynamic compression ratio mechanism based on image complexity.
Result: Achieves 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8× faster prefilling speed compared to original model.
Conclusion: PruneSID effectively balances importance preservation and information diversity in visual token compression, generalizes across diverse VLMs and both image/video modalities, and demonstrates strong cross-modal versatility.
Abstract: Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.
[183] Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion
Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi
Main category: cs.CV
TL;DR: A component-aware, self-refining framework for sketch-to-image generation using a two-stage architecture with self-attention autoencoder, coordinate-preserving fusion, and spatially adaptive refinement.
Details
Motivation: Existing sketch-to-image methods struggle with reconstructing fine details, maintaining spatial alignment, and adapting across diverse sketch domains due to the abstract, sparse, and stylistically varied nature of sketches.Method: Two-stage architecture: 1) Self-Attention-based Autoencoder Network (SA2N) captures localized semantic features from component-wise sketch regions, 2) Coordinate-Preserving Gated Fusion (CGF) integrates features into coherent spatial layout, 3) Spatially Adaptive Refinement Revisor (SARR) with modified StyleGAN2 backbone performs iterative refinement guided by spatial context.
Result: Outperforms state-of-the-art GAN and diffusion models across facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets. On CelebAMask-HQ: 21% improvement in FID, 58% in IS, 41% in KID, and 20% in SSIM. Shows higher efficiency and visual coherence.
Conclusion: The proposed framework effectively addresses sketch-to-image challenges and demonstrates strong performance, making it suitable for applications in forensics, digital art restoration, and general sketch-based image synthesis.
Abstract: Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
[184] Streaming Autoregressive Video Generation via Diagonal Distillation
Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu
Main category: cs.CV
TL;DR: Diagonal Distillation: A video diffusion distillation method that uses asymmetric generation (more steps early, fewer later) to improve temporal coherence and reduce error accumulation in real-time video generation.
Details
Motivation: Current video diffusion distillation methods adapt image-specific techniques that neglect temporal dependencies, leading to reduced motion coherence, error accumulation in long sequences, and latency-quality trade-offs in real-time streaming applications.Method: Proposes Diagonal Distillation with asymmetric generation strategy: more denoising steps for early chunks, fewer for later chunks. Later chunks inherit appearance from early chunks and use partially denoised chunks as conditional inputs. Incorporates implicit optical flow modeling to preserve motion quality under step constraints.
Result: Achieves 31 FPS for 5-second video generation (2.61 seconds total), representing a 277.3x speedup over undistilled model while maintaining quality and temporal coherence.
Conclusion: Diagonal Distillation effectively addresses temporal dependency issues in video diffusion distillation, enabling real-time video generation with improved motion coherence and reduced error propagation.
Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
[185] Evolving Prompt Adaptation for Vision-Language Models
Enming Zhang, Jiayang Li, Yanru Wu, Zhenyu Liu, Yang Li
Main category: cs.CV
TL;DR: EvoPrompt is a novel framework for few-shot adaptation of vision-language models that prevents catastrophic forgetting by explicitly steering prompt evolution through decoupled directional/magnitude updates and feature regularization.
Details
Motivation: Addressing the challenge of adapting large-scale vision-language models to downstream tasks with limited labeled data while avoiding catastrophic forgetting of pre-trained knowledge, which is a common problem in parameter-efficient prompt learning methods.Method: Proposes EvoPrompt with: 1) Modality-Shared Prompt Projector (MPP) generating hierarchical prompts from unified embedding space, 2) Evolutionary training strategy decoupling low-rank updates into directional (preserved) and magnitude (adapted) components, 3) Feature Geometric Regularization (FGR) enforcing feature decorrelation to prevent representation collapse.
Result: Extensive experiments show EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
Conclusion: EvoPrompt successfully addresses catastrophic forgetting in vision-language model adaptation by governing prompt evolution trajectories, enabling stable knowledge-preserving fine-tuning for few-shot learning scenarios.
Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
[186] SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding
Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming Jin
Main category: cs.CV
TL;DR: SurgFed: A multi-task federated learning framework for surgical scene segmentation and depth estimation using language guidance to address tissue and task diversity challenges in heterogeneous clinical environments.
Details
Motivation: Current surgical scene Multi-Task Federated Learning (MTFL) faces two key challenges: (1) Tissue Diversity - local models struggle with site-specific tissue features in heterogeneous clinical environments, and (2) Task Diversity - server-side aggregation using gradient-based clustering produces suboptimal parameter updates due to inter-site task heterogeneity, leading to inaccurate localization.Method: SurgFed introduces two language-guided designs: 1) Language-guided Channel Selection (LCS) - a lightweight personalized channel selection network using pre-defined text inputs to enhance site-specific adaptation, and 2) Language-guided Hyper Aggregation (LHA) - employs layer-wise cross-attention with text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates.
Result: Extensive empirical evidence shows SurgFed yields improvements over state-of-the-art methods in five public datasets across four surgical types.
Conclusion: SurgFed successfully addresses tissue and task diversity challenges in surgical scene understanding through language-guided federated learning, enabling effective multi-task learning across diverse surgical environments.
Abstract: Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.
[187] Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Won Shik Jang, Ue-Hwan Kim
Main category: cs.CV
TL;DR: Context-Nav: A zero-shot navigation system that uses full contextual captions as global exploration priors and performs viewpoint-aware 3D spatial reasoning for fine-grained instance disambiguation in cluttered 3D scenes.
Details
Motivation: Text-goal instance navigation (TGIN) requires agents to navigate to specific object instances among same-category distractors using free-form descriptions. Existing approaches often fail to effectively use contextual information and lack robust spatial reasoning for verification.Method: Two-stage approach: 1) Compute dense text-image alignments to create a value map that ranks frontiers, using entire contextual captions as global exploration priors. 2) Upon candidate detection, perform viewpoint-aware relation check by sampling plausible observer poses, aligning local frames, and verifying spatial relations from at least one viewpoint.
Result: Achieves state-of-the-art performance on InstanceNav and CoIN-Bench without task-specific training or fine-tuning. Ablations show that using full captions avoids wasted motion and explicit 3D verification prevents incorrect stops.
Conclusion: Geometry-grounded spatial reasoning provides a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
Abstract: Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers – guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
[188] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani
Main category: cs.CV
TL;DR: This paper investigates whether Vision-Language Models (VLMs) can provide consistent, temporally-grounded reasoning for driving assistance, finding they often fail at temporal reasoning and show response inconsistency despite strong visual understanding.
Details
Motivation: The paper aims to critically examine whether VLMs used as driving assistants can perform temporally grounded reasoning from observed information, rather than just relying on memorized patterns from training. The authors question the assumption that strong visual interpretation naturally enables consistent future reasoning for reliable decision-making.Method: The authors identify two major challenges: response inconsistency and limited temporal reasoning. They adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark for future scene reasoning. They also propose a self-supervised tuning approach with chain-of-thought reasoning to improve consistency and temporal reasoning without requiring temporal labels.
Result: The research finds that VLMs exhibit response inconsistency where minor input perturbations yield different answers, and limited temporal reasoning where models fail to align sequential events from current observations. Models with strong visual understanding don’t necessarily perform best on temporal reasoning tasks, indicating over-reliance on pretrained patterns rather than modeling temporal dynamics.
Conclusion: Current VLMs have significant limitations in temporal reasoning for driving assistance, requiring specialized benchmarks like FutureVQA and novel training approaches like self-supervised tuning with chain-of-thought reasoning to improve consistency and temporal understanding.
Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
[189] RESBev: Making BEV Perception More Robust
Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang
Main category: cs.CV
TL;DR: RESBev is a plug-and-play method that enhances robustness of Bird’s-eye-view perception in autonomous driving against sensor degradation and adversarial attacks by learning latent semantic predictions from spatiotemporal correlations.
Details
Motivation: Real-world deployment of BEV perception faces challenges from sensor degradation and adversarial attacks, which cause perceptual anomalies and compromise autonomous driving safety. Existing methods lack robustness against diverse disturbances.Method: Reframes perception robustness as latent semantic prediction problem. Constructs latent world model to extract spatiotemporal correlations across sequential BEV observations, learning underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. Operates at semantic feature level of Lift-Splat-Shoot pipeline without modifying backbone.
Result: Extensive experiments on nuScenes dataset show RESBev significantly improves robustness of existing BEV perception models against various external disturbances and adversarial attacks with few-shot fine-tuning.
Conclusion: RESBev provides a resilient, plug-and-play solution that enhances BEV perception robustness against diverse disturbances while maintaining compatibility with existing methods.
Abstract: Bird’s-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.
[190] DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation
Yanxin Li, Hui Wan, Libin Lan
Main category: cs.CV
TL;DR: DCAU-Net: A medical image segmentation framework using Differential Cross Attention for efficient long-range dependency modeling and Channel-Spatial Feature Fusion for adaptive feature integration.
Details
Motivation: Transformers help with long-range dependencies in medical image segmentation but have quadratic computational complexity and assign attention to irrelevant regions, diluting focus on discriminative structures. Existing attention variants reduce computation but impair global context modeling, and conventional fusion strategies fail to adaptively integrate semantic and spatial information.Method: Proposes DCAU-Net with two key components: 1) Differential Cross Attention (DCA) computes difference between two independent softmax attention maps to highlight discriminative structures, using window-level summary tokens instead of pixel-wise tokens to reduce computation; 2) Channel-Spatial Feature Fusion (CSFF) adaptively recalibrates features from skip connections and up-sampling paths using sequential channel and spatial attention to suppress redundant information.
Result: Experiments on two public benchmarks demonstrate DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
Conclusion: DCAU-Net effectively addresses transformer limitations in medical image segmentation by combining efficient attention mechanisms with adaptive feature fusion, achieving better accuracy and efficiency.
Abstract: Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
[191] Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
Main category: cs.CV
TL;DR: RL-based post-training strategy to enable multimodal interleaved generation in unified vision-language models without large interleaved datasets
Details
Motivation: Existing unified vision-language models lack the capability to produce multimodal interleaved outputs (alternating text and images), which is crucial for tasks like visual storytelling and step-by-step visual reasoningMethod: Two-stage approach: 1) Warm-up stage using hybrid dataset with curated interleaved sequences, 2) Unified policy optimization framework extending GRPO to multimodal setting with joint text-image generation modeling and hybrid rewards (textual relevance, visual-text alignment, structural fidelity) plus process-level rewards
Result: Significantly enhances quality and coherence of multimodal interleaved generation on MMIE and InterleavedBench benchmarks
Conclusion: Proposed RL-based post-training strategy successfully unlocks multimodal interleaved generation capability in existing unified models without requiring large-scale interleaved datasets
Abstract: Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
[192] A comprehensive study of time-of-flight non-line-of-sight imaging
Julio Marco, Adrian Jarabo, Ji Hyun Nam, Alberto Tosi, Diego Gutierrez, Andreas Velten
Main category: cs.CV
TL;DR: Comprehensive comparative study of Time-of-Flight non-line-of-sight imaging methods using common formulation and hardware to objectively evaluate their performance limitations.
Details
Motivation: The proliferation of diverse ToF NLOS imaging methods with different formulations and hardware implementations makes it difficult to objectively assess their theoretical and experimental performance. There's a need for standardized evaluation under common conditions.Method: Develops a unified forward model for ToF NLOS measurements, relates simplified models to Radon transforms and phasor-based virtual line-of-sight imaging, then evaluates representative methods under identical hardware setup and similar photon counts.
Result: Existing methods show similar limitations in spatial resolution, visibility, and noise sensitivity under equal hardware constraints, with differences mainly stemming from method-specific parameters rather than fundamental advantages.
Conclusion: The study provides a reference methodology for objective comparison of ToF NLOS imaging methods, revealing that current approaches share core limitations when evaluated under standardized conditions.
Abstract: Time-of-Flight non-line-of-sight (ToF NLOS) imaging techniques provide state-of-the-art reconstructions of scenes hidden around corners by inverting the optical path of indirect photons scattered by visible surfaces and measured by picosecond resolution sensors. The emergence of a wide range of ToF NLOS imaging methods with heterogeneous formulae and hardware implementations obscures the assessment of both their theoretical and experimental aspects. We present a comprehensive study of a representative set of ToF NLOS imaging methods by discussing their similarities and differences under common formulation and hardware. We first outline the problem statement under a common general forward model for ToF NLOS measurements, and the typical assumptions that yield tractable inverse models. We discuss the relationship of the resulting simplified forward and inverse models to a family of Radon transforms, and how migrating these to the frequency domain relates to recent phasor-based virtual line-of-sight imaging models for NLOS imaging that obey the constraints of conventional lens-based imaging systems. We then evaluate performance of the selected methods on hidden scenes captured under the same hardware setup and similar photon counts. Our experiments show that existing methods share similar limitations on spatial resolution, visibility, and sensitivity to noise when operating under equal hardware constraints, with particular differences that stem from method-specific parameters. We expect our methodology to become a reference in future research on ToF NLOS imaging to obtain objective comparisons of existing and new methods.
[193] GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang
Main category: cs.CV
TL;DR: GeoSolver framework introduces process-supervised reinforcement learning for remote sensing VLMs, using a token-level process reward model (GeoPRM) trained on synthesized data to enable verifiable step-by-step reasoning with test-time scaling capabilities.
Details
Motivation: Current VLMs for remote sensing struggle with complex step-by-step reasoning, and existing Chain-of-Thought approaches lack visual faithfulness verification for intermediate reasoning steps, creating a critical bottleneck for reliable geospatial analysis.Method: 1) Create Geo-PRM-2M dataset via entropy-guided MCTS with visual hallucination injection; 2) Train GeoPRM token-level process reward model; 3) Develop Process-Aware Tree-GRPO RL algorithm with tree-structured exploration and faithfulness-weighted rewards; 4) Build GeoSolver-9B model.
Result: GeoSolver-9B achieves SOTA across diverse remote sensing benchmarks. GeoPRM enables robust Test-Time Scaling and serves as universal geospatial verifier, enhancing both GeoSolver-9B and general-purpose VLMs with cross-model generalization.
Conclusion: The framework successfully addresses visual faithfulness in remote sensing reasoning through process-supervised RL, with GeoPRM providing scalable verification that generalizes across models, advancing reliable multimodal reasoning in geospatial applications.
Abstract: While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
[194] Grounding Synthetic Data Generation With Vision and Language Models
Ümit Mert Çağlar, Alptekin Temizel
Main category: cs.CV
TL;DR: A vision-language framework for interpretable synthetic data augmentation in remote sensing, featuring ARAS400k dataset with 100k real + 300k synthetic images for segmentation and captioning tasks.
Details
Motivation: Existing synthetic data evaluation metrics rely on latent feature similarity which is difficult to interpret and doesn't correlate well with downstream task performance. Need for interpretable synthetic data evaluation in remote sensing.Method: Combines generative models, semantic segmentation, and image captioning with vision-language models. Introduces ARAS400k dataset with automated evaluation through semantic composition analysis, caption redundancy minimization, and cross-modal consistency verification.
Result: Models trained on synthetic-only data reach competitive performance, but models trained with augmented data (real+synthetic) consistently outperform real-data baselines in semantic segmentation and image captioning tasks.
Conclusion: Establishes a scalable benchmark for remote sensing tasks with interpretable synthetic data evaluation framework. The approach demonstrates effectiveness of synthetic data augmentation when properly evaluated.
Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.
[195] GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning
Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang
Main category: cs.CV
TL;DR: GeoAlignCLIP improves remote sensing vision-language models by learning multi-granular semantic alignments and intra-modal consistency for better fine-grained image-text alignment, with a new RSFG-100k dataset.
Details
Motivation: Existing remote sensing vision-language models rely too much on global image-text alignment and fail to capture fine-grained details, limiting performance on complex tasks requiring precise visual-semantic understanding.Method: Proposes GeoAlignCLIP framework with multi-granular semantic alignments and intra-modal consistency learning. Also creates RSFG-100k dataset with scene descriptions, region-level annotations, and hard-negative samples for hierarchical supervision.
Result: Outperforms existing RS-specific methods across multiple public remote-sensing benchmarks, showing more robust and accurate fine-grained vision-language alignment.
Conclusion: The approach effectively addresses fine-grained alignment challenges in remote sensing vision-language tasks through multi-granular learning and comprehensive dataset construction.
Abstract: Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model’s ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
[196] More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: PLM introduces panoramic vision-language modeling using a plug-and-play attention module to adapt existing VLMs for 360° imagery, with PanoVQA dataset for comprehensive omni-scene reasoning.
Details
Motivation: Existing VLMs are designed for pinhole imagery and require stitching multiple narrow FOV inputs for omni-scene understanding, which overlooks holistic spatial and contextual relationships inherent in single panoramas.Method: Proposes Panorama-Language Modeling (PLM) paradigm with a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining, plus introduces PanoVQA dataset.
Result: PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, demonstrating understanding greater than the sum of narrow parts.
Conclusion: PLM establishes a foundation for unified 360° vision-language reasoning that preserves holistic spatial relationships better than multi-view pinhole approaches.
Abstract: Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.
[197] AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering
Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le
Main category: cs.CV
TL;DR: Vietnamese VQA using transformer architectures with systematic evaluation metric comparison in multilingual settings
Details
Motivation: Address Vietnamese VQA as a low-resource multimodal task, leveraging transformer-based architectures and comparing evaluation metrics to improve alignment with human judgmentMethod: Transformer-based architectures combining PhoBERT for Vietnamese language understanding and Vision Transformers for image representation, with systematic comparison of automatic evaluation metrics (BLEU, METEOR, CIDEr, Recall, Precision, F1-score)
Result: Not specified in abstract, but implies progress in Vietnamese VQA using multimodal transformer fusion and evaluation metric analysis
Conclusion: Transformer-based approaches show promise for Vietnamese VQA, with evaluation metric comparison providing insights for multilingual multimodal systems
Abstract: Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains – such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning – multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.
[198] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
Chaodong Xiao, Zhengqiang Zhang, Lei Zhang
Main category: cs.CV
TL;DR: BinaryAttention: 1-bit quantization for vision transformer attention using sign-only queries/keys with bitwise operations, achieving >2x speedup over FlashAttention2 while maintaining accuracy.
Details
Motivation: Transformers have computational complexity bottlenecks in attention modules for vision tasks. Existing 8-bit/4-bit quantization methods balance efficiency and accuracy, but 1-bit attention could offer greater speedups while preserving essential similarity relationships.Method: Retains only sign of queries and keys, replaces floating dot products with bitwise operations. Mitigates information loss with learnable bias, uses quantization-aware training and self-distillation to maintain accuracy while ensuring sign-aligned similarity.
Result: BinaryAttention is >2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks show it matches or exceeds full-precision attention performance.
Conclusion: Provides highly efficient 1-bit attention alternative for vision and diffusion transformers, pushing frontier of low-bit quantization while maintaining accuracy through theoretical justification and practical techniques.
Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.
[199] Ego: Embedding-Guided Personalization of Vision-Language Models
Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi
Main category: cs.CV
TL;DR: A method for personalizing multimodal language models by extracting visual tokens representing target concepts using internal attention mechanisms, enabling efficient concept recall without additional training.
Details
Motivation: Current approaches to personalizing large vision language models either require additional training (limiting generality/scalability) or use engineered pipelines with external modules (hindering deployment efficiency). There's a need for efficient personalization that leverages the model's inherent capabilities.Method: Extract visual tokens that predominantly represent target concepts using the model’s internal attention mechanisms. These tokens serve as memory for specific concepts, enabling the model to recall and describe them when they appear in test images.
Result: Comprehensive evaluation across single-concept, multi-concept, and video personalization settings shows strong performance gains with minimal personalization overhead compared to SOTA methods.
Conclusion: Proposed method enables efficient personalization of multimodal language models by leveraging internal attention mechanisms for concept extraction and recall, overcoming limitations of existing approaches.
Abstract: AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model’s inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model’s internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
[200] ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho
Main category: cs.CV
TL;DR: ParTY is a novel framework for text-to-motion synthesis that enhances part expressiveness while generating coherent full-body motions by using part-guided networks, part-aware text grounding, and holistic-part fusion.
Details
Motivation: Existing text-to-motion methods struggle with actions involving specific body parts, lacking explicit alignment between textual semantics and individual body parts, and often generate incoherent full-body motions when integrating independently generated part motions.Method: ParTY framework includes: (1) Part-Guided Network that generates part motions first to obtain guidance for holistic motion generation, (2) Part-aware Text Grounding that transforms text embeddings and aligns them with each body part, and (3) Holistic-Part Fusion that adaptively fuses holistic and part motions.
Result: Extensive experiments including part-level and coherence-level evaluations demonstrate that ParTY achieves substantial improvements over previous methods in text-to-motion synthesis.
Conclusion: ParTY successfully addresses the limitations of existing methods by enhancing part expressiveness while maintaining full-body motion coherence through its novel three-component framework.
Abstract: Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
[201] A saccade-inspired approach to image classification using visiontransformer attention maps
Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet, Benoît Miramond
Main category: cs.CV
TL;DR: Vision Transformers with DINO attention maps can mimic human saccadic eye movements for efficient image processing, achieving comparable or better classification performance while focusing only on key regions.
Details
Motivation: Human vision operates efficiently through selective attention and saccadic eye movements, while conventional AI systems process entire images uniformly. The paper aims to create more efficient image processing models inspired by biological vision systems.Method: Uses DINO (self-supervised Vision Transformer) to generate attention maps similar to human gaze patterns, then implements a saccade-inspired method to selectively process key image regions. Evaluates on ImageNet classification by measuring how successive saccades affect class scores.
Result: The selective processing strategy preserves most full-image classification performance and can even outperform it in some cases. DINO provides superior fixation guidance compared to established saliency models for human gaze prediction.
Conclusion: Vision Transformer attention serves as a promising basis for biologically inspired active vision, opening new directions for efficient, neuromorphic visual processing systems.
Abstract: Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model’s class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.
[202] MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang
Main category: cs.CV
TL;DR: A novel benchmark (MA-EgoQA) and baseline model (EgoMAS) for understanding multiple long-horizon egocentric videos from embodied AI agents, addressing challenges in multi-agent system communication and memory construction.
Details
Motivation: As humans collaborate with multiple embodied AI agents, there's a need for better communication by interpreting multiple sensory inputs (videos) in parallel and constructing system-level memory from multiple egocentric perspectives.Method: Introduces MultiAgent-EgoQA benchmark with 1.7k questions across 5 categories, and proposes EgoMAS baseline model that uses shared memory across agents and agent-wise dynamic retrieval.
Result: Current approaches struggle with multiple egocentric streams, showing the need for advances in system-level understanding across agents.
Conclusion: The work establishes a foundation for research in multi-agent egocentric video understanding, highlighting current limitations and providing a benchmark for future progress.
Abstract: As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.
[203] Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution
Shuting Liu, Lei Zhang, Wei Huang, Zhao Zhang, Zizhou Wang
Main category: cs.CV
TL;DR: A zero-shot MRI super-resolution framework using explicit Gaussian representation that balances data requirements and computational efficiency for clinical MRI applications.
Details
Motivation: High-resolution MRI is clinically important but limited by long acquisition times and motion artifacts. Existing super-resolution methods face trade-offs: paired-data methods need expensive aligned datasets, while implicit neural representation approaches avoid data needs but require heavy computation.Method: Proposes a zero-shot MRI SR framework using explicit Gaussian representation with MRI-tailored Gaussian parameters that embed tissue physical properties. Uses physics-grounded volume rendering to model MRI signal formation via normalized Gaussian aggregation, and a brick-based order-independent rasterization scheme for highly parallel 3D computation.
Result: Experiments on two public MRI datasets show superior reconstruction quality and efficiency compared to existing methods, demonstrating potential for clinical MRI super-resolution applications.
Conclusion: The proposed framework effectively balances data requirements and computational efficiency for MRI super-resolution, offering a practical solution for clinical applications where both high-quality reconstruction and computational efficiency are important.
Abstract: High-resolution Magnetic Resonance Imaging (MRI) is vital for clinical diagnosis but limited by long acquisition times and motion artifacts. Super-resolution (SR) reconstructs low-resolution scans into high-resolution images, yet existing methods are mutually constrained: paired-data methods achieve efficiency only by relying on costly aligned datasets, while implicit neural representation approaches avoid such data needs at the expense of heavy computation. We propose a zero-shot MRI SR framework using explicit Gaussian representation to balance data requirements and efficiency. MRI-tailored Gaussian parameters embed tissue physical properties, reducing learnable parameters while preserving MR signal fidelity. A physics-grounded volume rendering strategy models MRI signal formation via normalized Gaussian aggregation. Additionally, a brick-based order-independent rasterization scheme enables highly parallel 3D computation, lowering training and inference costs. Experiments on two public MRI datasets show superior reconstruction quality and efficiency, demonstrating the method’s potential for clinical MRI SR.
[204] Decoder-Free Distillation for Quantized Image Restoration
S. M. A. Sharif, Abdur Rehman, Seongwan Kim, Jaeho Lee
Main category: cs.CV
TL;DR: QDR framework enables efficient image restoration for edge devices by addressing quantization-distillation challenges through self-distillation, decoder-free distillation, and learnable gradient balancing.
Details
Motivation: While QAT with KD shows promise for model compression, applying it to precision-sensitive image restoration tasks faces three key bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and optimization conflicts between reconstruction and distillation losses due to quantization noise.Method: QDR framework includes: 1) FP32 self-distillation to eliminate capacity mismatch, 2) Decoder-Free Distillation (DFD) that corrects quantization errors only at network bottleneck to prevent spatial error amplification, 3) Learnable Magnitude Reweighting (LMR) to dynamically balance competing gradients, and 4) Edge-Friendly Model (EFM) with Learnable Degradation Gating (LDG) for spatial degradation localization.
Result: Int8 model recovers 96.5% of FP32 performance, achieves 442 FPS on NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP across four image restoration tasks.
Conclusion: QDR effectively addresses quantization-distillation challenges for image restoration, enabling high-performance edge deployment with minimal quality degradation while maintaining computational efficiency.
Abstract: Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization “tug-of-war” between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP
[205] OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
Zikun Chen, Wentao Zhao, Yihe Niu, Tianchen Deng, Jingchuan Wang
Main category: cs.CV
TL;DR: A stereo visual-inertial odometry system that uses deep descriptors and optimal transport for robust line feature matching in challenging environments with low texture and illumination changes.
Details
Motivation: Traditional point-based VIO systems struggle in low-texture scenes and under abrupt illumination changes where point features become sparse and unstable. Line features offer complementary geometric cues but existing point-line systems rely on point-guided line association which fails when point support is weak.Method: Proposes a stereo point-line VIO system with dedicated deep descriptors for line segments matched using entropy-regularized optimal transport for globally consistent correspondences. Uses training-free descriptors computed by sampling and pooling network feature maps, and introduces reliability-adaptive weighting to regulate line constraint influence during optimization.
Result: Experiments on EuRoC and UMA-VI datasets, plus real-world deployments in low-texture and illumination-challenging environments, show improved accuracy and robustness over representative baselines while maintaining real-time performance.
Conclusion: The proposed system effectively addresses limitations of point-based VIO in challenging environments by leveraging robust line feature matching with deep descriptors and optimal transport, demonstrating practical applicability in real-world scenarios.
Abstract: Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.
[206] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer’s Disease Neuroimaging Initiative
Main category: cs.CV
TL;DR: ACADiff is a diffusion-based framework that synthesizes missing brain imaging modalities (sMRI, FDG-PET, AV45-PET) for Alzheimer’s disease diagnosis by adaptively fusing available data with clinical metadata guidance.
Details
Motivation: Clinical neuroimaging datasets often have missing modalities, which hinders comprehensive Alzheimer's disease diagnosis. Existing methods struggle with extreme missing scenarios and fail to effectively integrate clinical metadata for guidance.Method: Uses adaptive clinical-aware diffusion with progressive denoising of latent representations. Features adaptive fusion that dynamically reconfigures based on input availability, semantic clinical guidance via GPT-4o-encoded prompts, and three specialized generators for bidirectional synthesis among sMRI, FDG-PET, and AV45-PET.
Result: Achieves superior generation quality and maintains robust diagnostic performance even under extreme 80% missing scenarios on ADNI subjects, outperforming all existing baselines.
Conclusion: ACADiff effectively addresses missing modality challenges in Alzheimer’s disease diagnosis through adaptive diffusion with clinical guidance, demonstrating strong performance in extreme missing scenarios.
Abstract: Multimodal neuroimaging provides complementary insights for Alzheimer’s disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff
[207] DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics
Yuanhang Lei, Boming Zhao, Zesong Yang, Xingxuan Li, Tao Cheng, Haocheng Peng, Ru Zhang, Yang Yang, Siyuan Huang, Yujun Shen, Ruizhen Hu, Hujun Bao, Zhaopeng Cui
Main category: cs.CV
TL;DR: DiffWind: A physics-informed differentiable framework for reconstructing and simulating wind-driven object dynamics from video using differentiable rendering, MPM for wind-object interaction, and LBM for fluid dynamics constraints.
Details
Motivation: Wind-driven object dynamics are challenging to model from video due to invisible wind, spatio-temporal variability, and complex object deformations. Current methods lack physics-based approaches for reconstructing wind forces and simulating interactions.Method: Unified framework using: 1) Grid-based wind field representation, 2) Particle-based object representation from 3D Gaussian Splatting, 3) Material Point Method (MPM) for wind-object interaction, 4) Differentiable rendering and simulation for joint optimization, 5) Lattice Boltzmann Method (LBM) as physics-informed constraint.
Result: Significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity. Introduces WD-Objects dataset and enables novel applications like wind retargeting and forward simulation under new wind conditions.
Conclusion: DiffWind opens new avenues for video-based wind-object interaction modeling by unifying reconstruction and simulation with physics-informed constraints, enabling accurate recovery of invisible wind forces and realistic forward simulation.
Abstract: Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.
[208] VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
Anh Thuan Tran, Jana Kosecka
Main category: cs.CV
TL;DR: VarSplat introduces uncertainty-aware 3D Gaussian Splatting for SLAM by learning per-splat appearance variance to improve robustness in challenging scenes.
Details
Motivation: Existing 3DGS-SLAM approaches handle measurement reliability implicitly, making them susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties.Method: Explicitly learns per-splat appearance variance and uses law of total variance with alpha compositing to render differentiable per-pixel uncertainty maps via efficient single-pass rasterization.
Result: Improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering on Replica, TUM-RGBD, ScanNet, and ScanNet++ datasets compared to existing dense RGB-D SLAM methods.
Conclusion: VarSplat’s uncertainty-aware approach enhances 3DGS-SLAM robustness by focusing optimization on reliable regions, addressing limitations of existing methods in challenging visual conditions.
Abstract: Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then render differentiable per-pixel uncertainty map via efficient, single-pass rasterization. This map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.
[209] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
Yundi Zhang, Sevgi Gokce Kafali, Niklas Bubeck, Daniel Rueckert, Jiazhen Pan
Main category: cs.CV
TL;DR: k-MTR is a k-space representation learning framework that directly extracts diagnostic information from undersampled k-space data without intermediate image reconstruction, enabling multi-task analysis for cardiac MRI.
Details
Motivation: Traditional cardiac MRI pipelines follow a "reconstruct-then-analyze" approach that introduces artifacts and information bottlenecks by forcing image reconstruction from undersampled k-space, rather than directly extracting the low-dimensional physiological labels needed for diagnosis.Method: Proposes k-MTR framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold using a k-space encoder. The method forces the encoder to restore anatomical information lost to undersampling directly in latent space, bypassing explicit image reconstruction for downstream analysis.
Result: k-MTR achieves competitive performance across continuous phenotype regression, disease classification, and anatomical segmentation tasks against state-of-the-art image-domain baselines, demonstrating that precise spatial geometries and multi-task features can be recovered directly from k-space representations.
Conclusion: k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows by showing that direct k-space analysis can bypass reconstruction bottlenecks and enable efficient multi-task learning from undersampled data.
Abstract: Conventional clinical CMR pipelines rely on a sequential “reconstruct-then-analyze” paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
[210] Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture
Tom Wehrbein, Bodo Rosenhahn
Main category: cs.CV
TL;DR: FootMR refines 3D foot motion from videos by lifting 2D foot keypoints to 3D using motion capture data, improving fine-grained foot articulation accuracy.
Details
Motivation: Current 3D human motion recovery methods fail to capture fine-grained foot articulations due to inaccurate foot annotations and limited motion diversity in training datasets, which is critical for applications like gait analysis and animation.Method: FootMR refines foot motion by lifting 2D foot keypoint sequences to 3D, avoiding direct image input to circumvent inaccurate image-3D annotation pairs. It incorporates knee and foot motion as context, predicts residual foot motion, uses global joint rotations instead of parent-relative ones, and applies extensive data augmentation.
Result: FootMR outperforms state-of-the-art methods on MOOF, MOYO, and RICH datasets, reducing ankle joint angle error on MOYO by up to 30% compared to the best video-based approach.
Conclusion: FootMR effectively addresses the foot motion refinement problem by leveraging motion capture data and 2D keypoint lifting, significantly improving fine-grained foot articulation accuracy in 3D human motion recovery.
Abstract: State-of-the-art methods can recover accurate overall 3D human body motion from in-the-wild videos. However, they often fail to capture fine-grained articulations, especially in the feet, which are critical for applications such as gait analysis and animation. This limitation results from training datasets with inaccurate foot annotations and limited foot motion diversity. We address this gap with FootMR, a Foot Motion Refinement method that refines foot motion estimated by an existing human recovery model through lifting 2D foot keypoint sequences to 3D. By avoiding direct image input, FootMR circumvents inaccurate image-3D annotation pairs and can instead leverage large-scale motion capture data. To resolve ambiguities of 2D-to-3D lifting, FootMR incorporates knee and foot motion as context and predicts only residual foot motion. Generalization to extreme foot poses is further improved by representing joints in global rather than parent-relative rotations and applying extensive data augmentation. To support evaluation of foot motion reconstruction, we introduce MOOF, a 2D dataset of complex foot movements. Experiments on MOOF, MOYO, and RICH show that FootMR outperforms state-of-the-art methods, reducing ankle joint angle error on MOYO by up to 30% over the best video-based approach.
[211] TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering
Luca Carlini, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque
Main category: cs.CV
TL;DR: TemporalDoRA is a video-specific PEFT method that adds temporal attention in low-rank adaptation to improve surgical VideoQA by enabling temporal-aware updates while maintaining parameter efficiency.
Details
Motivation: Standard PEFT methods for VideoQA lack explicit modeling of frame-to-frame interactions, limiting their ability to exploit sparse temporal evidence and making them vulnerable to linguistic bias from question phrasing variations.Method: Extends Weight-Decomposed Low-Rank Adaptation by inserting lightweight temporal Multi-Head Attention inside the low-rank bottleneck of vision encoder and selectively applying weight decomposition only to trainable low-rank branch.
Result: Improves Out-of-Template performance on REAL-Colon-VQA dataset (6,424 clip-question pairs) and shows consistent improvements on EndoVis18-VQA adapted to short clips, with temporal mixing identified as primary driver of gains.
Conclusion: TemporalDoRA enables temporally-aware updates while preserving frozen backbone, improving robustness to linguistic variation in surgical VideoQA with minimal parameter overhead.
Abstract: Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip–question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\href{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}{Anonymous GitHub}.
[212] TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR
Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu Timofte
Main category: cs.CV
TL;DR: TriFusionSR: A wavelet-guided conditional diffusion framework for joint tri-modal medical image fusion and super-resolution that addresses resolution degradation and modality discrepancies through frequency-aware crossmodal interaction.
Details
Motivation: Multimodal medical image fusion combines complementary structural and functional information for comprehensive diagnosis, but current approaches suffer from resolution degradation, modality discrepancies, and artifacts from separate fusion and super-resolution stages, especially in tri-modal settings with anatomical (MRI, CT) and functional (PET, SPECT) scans.Method: Proposes TriFusionSR, a wavelet-guided conditional diffusion framework that: 1) Uses 2D Discrete Wavelet Transform to decompose multimodal features into frequency bands for frequency-aware crossmodal interaction, 2) Implements Rectified Wavelet Features (RWF) for latent coefficient calibration, and 3) Employs Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention for structure-driven multimodal refinement.
Result: Achieves state-of-the-art performance with 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales in extensive experiments.
Conclusion: TriFusionSR effectively addresses the limitations of separate fusion and super-resolution stages in multimodal medical imaging by enabling joint processing through frequency-aware decomposition and adaptive spatial-frequency fusion, resulting in superior perceptual quality and diagnostic utility.
Abstract: Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.
[213] ProGS: Towards Progressive Coding for 3D Gaussian Splatting
Zhiye Tang, Lingzhuo Liu, Shengjie Jiao, Qiudan Zhang, Junhui Hou, You Yang, Xu Wang
Main category: cs.CV
TL;DR: ProGS introduces progressive coding for 3D Gaussian Splatting using octree organization, achieving 45x compression and 10% visual improvement.
Details
Motivation: 3D Gaussian Splatting (3DGS) generates massive data that poses storage/transmission challenges, and existing methods lack progressive coding support needed for streaming applications with varying bandwidth.Method: Organizes 3DGS data into an octree structure for progressive coding, incorporates mutual information enhancement to reduce structural redundancy, adapts octree structure and dynamically adjusts anchor nodes for scalable compression.
Result: Achieves 45x reduction in file storage compared to original 3DGS format while improving visual performance by over 10%.
Conclusion: ProGS provides a robust streaming-friendly solution for real-time applications with varying network conditions through efficient progressive coding of 3DGS data.
Abstract: With the emergence of 3D Gaussian Splatting (3DGS), numerous pioneering efforts have been made to address the effective compression issue of massive 3DGS data. 3DGS offers an efficient and scalable representation of 3D scenes by utilizing learnable 3D Gaussians, but the large size of the generated data has posed significant challenges for storage and transmission. Existing methods, however, have been limited by their inability to support progressive coding, a crucial feature in streaming applications with varying bandwidth. To tackle this limitation, this paper introduce a novel approach that organizes 3DGS data into an octree structure, enabling efficient progressive coding. The proposed ProGS is a streaming-friendly codec that facilitates progressive coding for 3D Gaussian splatting, and significantly improves both compression efficiency and visual fidelity. The proposed method incorporates mutual information enhancement mechanisms to mitigate structural redundancy, leveraging the relevance between nodes in the octree hierarchy. By adapting the octree structure and dynamically adjusting the anchor nodes, ProGS ensures scalable data compression without compromising the rendering quality. ProGS achieves a remarkable 45X reduction in file storage compared to the original 3DGS format, while simultaneously improving visual performance by over 10%. This demonstrates that ProGS can provide a robust solution for real-time applications with varying network conditions.
[214] GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System
Zhiye Tang, Qiudan Zhang, Lei Zhang, Junhui Hou, You Yang, Xu Wang
Main category: cs.CV
TL;DR: GSStream is a volumetric scene streaming system for 3D Gaussian splatting data that uses collaborative viewport prediction and DRL-based bitrate adaptation for efficient real-time distribution.
Details
Motivation: 3D Gaussian splatting provides immersive volumetric scene representation but creates large data volumes that are bandwidth-intensive for real-time distribution. Existing compression approaches still face challenges for real-time streaming.Method: 1) Collaborative viewport prediction module learning from multiple users’ historical data and viewport sequences, 2) Deep reinforcement learning-based bitrate adaptation to handle variable state/action spaces, 3) First volumetric scene viewport trajectory dataset for training.
Result: GSStream outperforms existing volumetric scene streaming systems in both visual quality and network usage efficiency, as demonstrated through extensive experiments.
Conclusion: The proposed GSStream system effectively addresses the real-time distribution challenges of 3DGS data through intelligent viewport prediction and adaptive bitrate control.
Abstract: Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users’ future behaviors by learning collaborative priors and historical priors from multiple users and users’ viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: https://youtu.be/3WEe8PN8yvA.
[215] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation
Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran
Main category: cs.CV
TL;DR: FrameDiT introduces Matrix Attention for video diffusion models, using frame-level attention instead of token-level to better capture global spatio-temporal dynamics while maintaining efficiency.
Details
Motivation: Existing video diffusion models face a trade-off between expensive Full 3D Attention (strong but costly) and efficient Local Factorized Attention (limited temporal modeling). There's a need for an attention mechanism that can effectively capture complex spatio-temporal dynamics in videos while maintaining computational efficiency.Method: Proposes Matrix Attention, a frame-level temporal attention mechanism that processes entire frames as matrices using matrix-native operations. Builds FrameDiT-G using Matrix Attention, and FrameDiT-H which combines Matrix Attention with Local Factorized Attention to capture both large and small motion.
Result: FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
Conclusion: Matrix Attention effectively resolves the trade-off between modeling capacity and efficiency in video diffusion models, enabling better global spatio-temporal structure preservation and adaptation to significant motion in videos.
Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
[216] ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios
Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D’Ambra, Antonino Furnari, Giovanni Maria Farinella
Main category: cs.CV
TL;DR: ENIGMA-360 is a new ego-exo dataset for industrial human behavior understanding with synchronized 360° videos and annotations for temporal action segmentation, keystep recognition, and egocentric human-object interaction detection.
Details
Motivation: Current progress in understanding human behavior from complementary egocentric and exocentric views is hindered by lack of datasets capturing both views in realistic industrial scenarios, which is needed for developing systems to support workers and enhance safety.Method: Created ENIGMA-360 dataset with 180 egocentric and 180 exocentric procedural videos temporally synchronized in real industrial settings, with temporal and spatial annotations. Conducted baseline experiments for three foundational tasks using state-of-the-art approaches.
Result: Baseline experiments show limitations of current state-of-the-art approaches on this challenging industrial scenario, highlighting the need for new models capable of robust ego-exo understanding in real-world environments.
Conclusion: ENIGMA-360 addresses the dataset gap for ego-exo understanding in industrial settings and demonstrates current model limitations, paving way for development of more robust multimodal understanding systems for real-world applications.
Abstract: Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at https://iplab.dmi.unict.it/ENIGMA-360.
[217] LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie Lowry
Main category: cs.CV
TL;DR: LAP uses language descriptions from VLMs to create distinctive text embeddings for procedure planning, outperforming visual-only methods on multiple benchmarks.
Details
Motivation: Existing procedure planning methods relying on visual observations struggle with ambiguity where different actions appear visually similar. Language descriptions offer more distinctive representations in latent space.Method: Language-Aware Planning (LAP) uses a finetuned Vision Language Model to translate visual observations into text descriptions, predict actions, and extract text embeddings. These embeddings are used in a diffusion model for planning action sequences.
Result: LAP achieves new state-of-the-art performance across multiple metrics and time horizons on three procedure planning benchmarks: CrossTask, Coin, and NIV, demonstrating significant advantages over visual-only methods.
Conclusion: Language descriptions provide more distinctive representations than visual embeddings for procedure planning, and leveraging language through VLMs significantly improves planning performance across diverse benchmarks.
Abstract: Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.
[218] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control
Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi
Main category: cs.CV
TL;DR: LogoDiffuser: A training-free method for multilingual logo generation using diffusion transformers with character structure control via attention map injection
Details
Motivation: Existing text-to-image methods struggle with multilingual logo generation, often distorting character geometry when applying creative styles and requiring additional training for multilingual support. There's a need for a method that can generate multilingual logos while preserving character structure.Method: Proposes LogoDiffuser, a training-free approach using multimodal diffusion transformers. Instead of textual prompts, inputs target characters as images for robust character structure control. Analyzes joint attention to identify core tokens responding to textual structures, then injects informative attention maps to integrate character structure with visual design. Uses layer-wise attention map aggregation to mitigate attention shifts.
Result: Extensive experiments and user studies show state-of-the-art performance in multilingual logo generation, achieving better character structure preservation and visual quality compared to existing methods.
Conclusion: LogoDiffuser effectively addresses multilingual logo generation challenges without additional training, providing robust character structure control through attention mechanism analysis and injection.
Abstract: Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
[219] Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
Gorka Abad, Ermes Franch, Stefanos Koffas, Stjepan Picek
Main category: cs.CV
TL;DR: Backdoor attacks can be activated by alternative triggers distinct from training triggers, making trigger-centric defenses incomplete; backdoor directions in feature space persist even after removing known triggers.
Details
Motivation: Current backdoor defenses focus on neutralizing known triggers, but this paper shows this approach is incomplete because alternative triggers can activate the same backdoor, suggesting defenses should target backdoor directions in representation space instead.Method: Estimate alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. Theoretical proof of alternative trigger existence and empirical verification.
Result: Demonstrate that alternative triggers exist and are inevitable consequences of backdoor training. Show that defenses removing training triggers often leave backdoors intact, and alternative triggers can exploit latent backdoor feature-space.
Conclusion: Backdoor defenses should target backdoor directions in representation space rather than input-space triggers, as the trigger-centric view is incomplete and alternative triggers can bypass current defenses.
Abstract: Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
[220] What is Missing? Explaining Neurons Activated by Absent Concepts
Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan Roth
Main category: cs.CV
TL;DR: The paper introduces “encoded absences” as a novel type of causal relationship in neural networks where the absence of a concept increases neural activation, and proposes extensions to existing XAI methods to uncover these relationships.
Details
Motivation: Current XAI methods focus on identifying concepts that cause high neural activation (presence relationships), but overlook "encoded absences" where the absence of concepts increases activation. The authors argue these missing but relevant concepts are common and important for understanding model behavior.Method: Proposes two simple extensions to mainstream XAI methods (attribution and feature visualization techniques) to uncover encoded absences. The approach modifies existing techniques to reveal when neural activation increases due to the absence of certain concepts.
Result: Shows that mainstream XAI methods can be adapted to reveal encoded absences, demonstrates how ImageNet models exploit these relationships, and shows that debiasing can be improved when considering encoded absences.
Conclusion: Encoded absences are an important but overlooked type of causal relationship in neural networks. Extending existing XAI methods to uncover these relationships provides more complete explanations of model behavior and can improve debiasing efforts.
Abstract: Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.
[221] Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
Zhaofeng Shi, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Lili Pan, Hongliang Li
Main category: cs.CV
TL;DR: DCPGN enables test-time adaptation between egocentric and exocentric views for action anticipation without requiring target-view training data, using multi-label prototypes and dual-clue consistency.
Details
Motivation: Existing ego-exo adaptation methods require target-view data for training, which increases computational and data collection costs. The paper introduces test-time adaptation for action anticipation across views without needing target-view training.Method: Proposes Dual-Clue enhanced Prototype Growing Network (DCPGN) with: 1) Multi-Label Prototype Growing Module for balancing multiple action classes via confidence-based reweighting and entropy priority queue, 2) Dual-Clue Consistency Module that uses a lightweight narrator to generate textual clues about action progressions, complementing visual clues, and enforcing consistency between textual and visual logits.
Result: Extensive experiments on EgoMe-anti and EgoExoLearn benchmarks show DCPGN outperforms state-of-the-art methods by a large margin for test-time ego-exo adaptation and action anticipation.
Conclusion: DCPGN effectively addresses test-time ego-exo adaptation for action anticipation by accumulating multi-label knowledge and integrating cross-modality clues, bridging temporal-spatial gaps between views without requiring target-view training data.
Abstract: Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \href{https://github.com/ZhaofengSHI/DCPGN}{https://github.com/ZhaofengSHI/DCPGN}.
[222] RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding
Muyi Sun, Yixuan Wang, Hong Wang, Chen Su, Man Zhang, Xingqun Qi, Qi Li, Zhenan Sun
Main category: cs.CV
TL;DR: Paper introduces Region-Aware Sound Source Understanding (RA-SSU), a fine-grained audio-visual learning task with new datasets and SSUFormer model for sound source segmentation and description.
Details
Motivation: Previous audio-visual learning focused on coarse-grained tasks like audio-visual correspondence and sound source localization. The authors aim to provide more specific scene perception details through fine-grained, region-aware sound source understanding.Method: Proposes RA-SSU task and constructs two datasets: f-Music (3,976 samples, 22 music scenes) and f-Lifescene (6,156 samples, 61 life scenarios). Introduces SSUFormer with Mask Collaboration Module (MCM) for accuracy and Mixture of Hierarchical-prompted Experts (MoHE) for rich descriptions.
Result: Extensive experiments show SSUFormer achieves state-of-the-art performance on the sound source understanding benchmark, demonstrating task feasibility and dataset utility.
Conclusion: The paper successfully defines a new fine-grained audio-visual learning task, provides comprehensive datasets, and presents an effective model for region-aware sound source understanding.
Abstract: Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
[223] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation
Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada
Main category: cs.CV
TL;DR: ConfCtrl: A confidence-aware video interpolation framework for novel view synthesis from only two images under large viewpoint changes, using diffusion models guided by camera poses with Kalman-inspired predict-update mechanism.
Details
Motivation: Existing methods for novel view synthesis from few images have limitations: regression-based methods can't reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient camera pose conditioning.Method: Proposes ConfCtrl framework that initializes diffusion process with confidence-weighted projected point cloud latent plus noise as conditioning input. Uses Kalman-inspired predict-update mechanism treating projected point cloud as noisy measurement, with learned residual corrections to balance pose-driven predictions with geometric observations.
Result: Experiments on multiple datasets show ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
Conclusion: ConfCtrl enables diffusion models to follow prescribed camera poses while completing unseen regions, allowing models to rely on reliable projections while down-weighting uncertain regions for stable, geometry-aware generation.
Abstract: We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
[224] BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling
Guiliang Guo, Guangqi Wen, Lingwen Liu, Ruoxian Song, Peng Cao, Jinzhu Yang, Fei Wang, Xiaoli Liu, Osmar R. Zaiane
Main category: cs.CV
TL;DR: BrainSTR is a spatio-temporal contrastive learning framework for interpretable dynamic brain network modeling that identifies disease-relevant brain connectivity patterns across time and topology for neuropsychiatric diagnosis.
Details
Motivation: Dynamic functional connectivity analysis can capture time-varying brain states for better neuropsychiatric diagnosis, but faces challenges in identifying subtle, sparsely distributed diagnostic signals amidst pervasive nuisance fluctuations and non-diagnostic connectivities.Method: BrainSTR uses adaptive phase partitioning to learn state-consistent boundaries, identifies critical phases with attention, extracts disease-related connectivity via an incremental graph structure generator with binarization, temporal smoothness, and sparsity regularization, and employs spatio-temporal supervised contrastive learning to refine similarity metrics.
Result: Experiments on Autism Spectrum Disorder (ASD), Bipolar Disorder (BD), and Major Depressive Disorder (MDD) validate BrainSTR’s effectiveness, with discovered critical phases and subnetworks providing interpretable evidence consistent with prior neuroimaging findings.
Conclusion: BrainSTR offers a robust framework for interpretable dynamic brain network modeling that can identify diagnostically relevant spatio-temporal patterns in neuropsychiatric disorders, advancing both diagnostic accuracy and neurobiological understanding.
Abstract: Dynamic functional connectivity captures time-varying brain states for better neuropsychiatric diagnosis and spatio-temporal interpretability, i.e., identifying when discriminative disease signatures emerge and where they reside in the connectivity topology. Reliable interpretability faces major challenges: diagnostic signals are often subtle and sparsely distributed across both time and topology, while nuisance fluctuations and non-diagnostic connectivities are pervasive. To address these issues, we propose BrainSTR, a spatio-temporal contrastive learning framework for interpretable dynamic brain network modeling. BrainSTR learns state-consistent phase boundaries via a data-driven Adaptive Phase Partition module, identifies diagnostically critical phases with attention, and extracts disease-related connectivity within each phase using an Incremental Graph Structure Generator regularized by binarization, temporal smoothness, and sparsity. Then, we introduce a spatio-temporal supervised contrastive learning approach that leverages diagnosis-relevant spatio-temporal patterns to refine the similarity metric between samples and capture more discriminative spatio-temporal features, thereby constructing a well-structured semantic space for coherent and interpretable representations. Experiments on ASD, BD, and MDD validate the effectiveness of BrainSTR, and the discovered critical phases and subnetworks provide interpretable evidence consistent with prior neuroimaging findings. Our code: https://anonymous.4open.science/r/BrainSTR1.
[225] VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu
Main category: cs.CV
TL;DR: VLM-Loc uses vision-language models for text-to-point-cloud localization by converting point clouds to BEV images and scene graphs, enabling spatial reasoning for accurate 3D localization from natural language descriptions.
Details
Motivation: Existing text-to-point-cloud localization methods lack effective spatial reasoning capabilities, limiting accuracy in complex environments. The paper aims to leverage the spatial reasoning abilities of large vision-language models to bridge linguistic and spatial semantics for better 3D localization.Method: Transforms point clouds into bird’s-eye-view images and scene graphs to encode geometric and semantic context. Uses VLM to learn cross-modal representations and introduces partial node assignment mechanism to associate textual cues with scene graph nodes for interpretable spatial reasoning.
Result: VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods on the CityLoc benchmark built from multi-source point clouds for fine-grained text-to-point-cloud localization.
Conclusion: The proposed VLM-Loc framework effectively leverages VLMs’ spatial reasoning capabilities for text-to-point-cloud localization, demonstrating significant improvements over existing methods through structured cross-modal representations and interpretable reasoning mechanisms.
Abstract: Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.
[226] PnLCalib: Sports Field Registration via Points and Lines Optimization
Marc Gutiérrez-Pérez, Antonio Agudo
Main category: cs.CV
TL;DR: Optimization-based camera calibration pipeline for broadcast sports videos using 3D soccer field model and keypoints with novel refinement module for improved accuracy.
Details
Motivation: Camera calibration in broadcast sports videos faces challenges due to multiple camera angles, varying parameters, and field occlusions. Traditional search-based methods struggle with non-standard positions and dynamic environments.Method: Proposes optimization-based calibration pipeline using 3D soccer field model and predefined keypoints. Introduces novel refinement module that improves initial calibration by using detected field lines in non-linear optimization process.
Result: Outperforms existing techniques in both multi-view and single-view 3D camera calibration tasks, while maintaining competitive performance in homography estimation. Extensive experimentation on real-world datasets shows robustness and accuracy across diverse broadcast scenarios.
Conclusion: The approach offers significant improvements in camera calibration precision and reliability for broadcast sports videos, addressing limitations of traditional methods.
Abstract: Camera calibration in broadcast sports videos presents numerous challenges for accurate sports field registration due to multiple camera angles, varying camera parameters, and frequent occlusions of the field. Traditional search-based methods depend on initial camera pose estimates, which can struggle in non-standard positions and dynamic environments. In response, we propose an optimization-based calibration pipeline that leverages a 3D soccer field model and a predefined set of keypoints to overcome these limitations. Our method also introduces a novel refinement module that improves initial calibration by using detected field lines in a non-linear optimization process. This approach outperforms existing techniques in both multi-view and single-view 3D camera calibration tasks, while maintaining competitive performance in homography estimation. Extensive experimentation on real-world soccer datasets, including SoccerNet-Calibration, WorldCup 2014, and TS-WorldCup, highlights the robustness and accuracy of our method across diverse broadcast scenarios. Our approach offers significant improvements in camera calibration precision and reliability.
[227] MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities
Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi Nguyen
Main category: cs.CV
TL;DR: MissBench introduces a benchmark for multimodal affective computing that evaluates models under realistic imbalanced missing-modality conditions, with diagnostic metrics for modality equity and learning balance.
Details
Motivation: Standard multimodal evaluations assume equal modality availability, but real applications have imbalanced missing rates that create training biases not revealed by task-level metrics alone.Method: Developed MissBench benchmark with standardized shared and imbalanced missing-rate protocols on four sentiment/emotion datasets, plus Modality Equity Index (MEI) and Modality Learning Index (MLI) diagnostic metrics.
Result: Experiments show models appearing robust under shared missing rates exhibit modality inequity and optimization imbalance under imbalanced conditions, revealing hidden biases.
Conclusion: MissBench with MEI and MLI provides practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.
Abstract: Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.For reproducibility, we release our code at: https://anonymous.4open.science/r/MissBench-4098/
[228] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang
Main category: cs.CV
TL;DR: InternVL-U is a lightweight 4B-parameter unified multimodal model that integrates understanding, reasoning, generation, and editing capabilities, achieving superior performance-efficiency balance compared to larger models.
Details
Motivation: Address the inherent trade-offs in unified multimodal models between maintaining strong semantic comprehension and acquiring powerful generation capabilities, aiming to democratize these capabilities in a lightweight framework.Method: Uses unified contextual modeling and modality-specific modular design with decoupled visual representations, integrating a state-of-the-art MLLM with a specialized MMDiT-based visual generation head. Employs a comprehensive data synthesis pipeline for high-semantic-density tasks using Chain-of-Thought reasoning to align abstract intent with visual generation details.
Result: Consistently outperforms unified baseline models with over 3x larger scales (like 14B BAGEL) on various generation and editing tasks while retaining strong multimodal understanding and reasoning capabilities.
Conclusion: InternVL-U successfully demonstrates that a lightweight 4B-parameter unified multimodal model can achieve superior performance-efficiency balance, effectively bridging the gap between aesthetic generation and high-level intelligence.
Abstract: Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
[229] DRUPI: Dataset Reduction Using Privileged Information
Shaobo Wang, Youxin Jiang, Tianle Niu, Yantai Yang, Ruiji Zhang, Shuhao Hu, Shuaiyu Zhang, Chenghao Sun, Weiya Li, Conghui He, Xuming Hu, Linfeng Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2410.01611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang
Main category: cs.CV
TL;DR: DISPLAY is a framework for generating human-object interaction videos using sparse motion guidance (wrist joints + object bounding box) with object-stressed attention and multi-task auxiliary training.
Details
Motivation: Existing human-centric video generation methods struggle with controllable and physically consistent human-object interactions, often requiring dense control signals, template videos, or carefully crafted prompts that limit flexibility and generalization to novel objects.Method: Uses sparse motion guidance (wrist joint coordinates + shape-agnostic object bounding box), object-stressed attention mechanism to improve object robustness, and multi-task auxiliary training strategy with dedicated data curation pipeline to address HOI data scarcity.
Result: Achieves high-fidelity, controllable HOI generation across diverse tasks, with the sparse guidance enabling intuitive user control while maintaining physical consistency.
Conclusion: DISPLAY demonstrates that sparse motion guidance combined with object-stressed attention and multi-task training can effectively generate controllable and physically consistent human-object interaction videos.
Abstract: Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
[231] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding
Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin Chen
Main category: cs.CV
TL;DR: C2FMAE is a coarse-to-fine masked autoencoder that learns hierarchical visual representations across three granularities (scene, object, pixel) using cascaded decoding and progressive masking to address limitations of contrastive learning and masked image modeling.
Details
Motivation: To resolve the tension between contrastive learning (which captures global semantics but loses fine details) and masked image modeling (which preserves local textures but suffers from attention drift due to random masking). The goal is to learn more robust and generalizable visual representations.Method: Proposes C2FMAE with two key innovations: 1) A cascaded decoder that sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies. 2) A progressive masking curriculum that shifts training focus from semantic-guided to instance-guided to random masking. Uses a large-scale multi-granular dataset with pseudo-labels for 1.28M ImageNet-1K images.
Result: Achieves significant performance gains on image classification, object detection, and semantic segmentation tasks, validating the effectiveness of the hierarchical design in learning more robust and generalizable representations.
Conclusion: C2FMAE successfully bridges the gap between global semantic understanding and local detail preservation through hierarchical representation learning, offering a more comprehensive approach to self-supervised visual pre-training.
Abstract: Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from “attention drift” due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
[232] Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
Main category: cs.CV
TL;DR: CourtSI: A large-scale spatial intelligence dataset for sports scenarios with over 1M QA pairs covering spatial reasoning tasks, plus a benchmark showing VLMs’ limitations in sports spatial understanding.
Details
Motivation: Sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions, which is valuable for advancing spatial intelligence in vision-language models. Existing benchmarks don't adequately capture the spatial reasoning challenges present in sports scenarios.Method: Created CourtSI dataset with over 1M QA pairs organized under a taxonomy covering spatial counting, distance measurement, localization, and relational reasoning across net sports. Used semi-automatic data engine leveraging court geometry as metric anchors to reconstruct sports scenes. Also introduced CourtSI-Bench evaluation benchmark with 3,686 human-verified QA pairs.
Result: Evaluation of 25 VLMs on CourtSI-Bench revealed significant human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. Fine-tuning Qwen3-VL-8B on CourtSI improved accuracy by 23.5 percentage points, with the model generalizing effectively to similar unseen sports and showing enhanced spatial-aware commentary generation.
Conclusion: CourtSI provides a scalable pathway for advancing spatial intelligence in VLMs for sports scenarios, exposing limitations in current models and benchmarks while demonstrating that targeted training can significantly improve spatial reasoning capabilities.
Abstract: Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
[233] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
Main category: cs.CV
TL;DR: WikiCLIP is a contrastive learning framework for open-domain visual entity recognition that uses LLM embeddings enhanced with vision-guided knowledge adaptation and hard negative synthesis, achieving state-of-the-art performance with 100x faster inference than generative methods.
Details
Motivation: Current generative methods for visual entity recognition (VER) have strong performance but high computational costs, limiting scalability and practical deployment. The authors aim to revisit contrastive learning to create a more efficient yet effective baseline for open-domain VER.Method: WikiCLIP uses large language model embeddings as knowledge-rich entity representations, enhanced with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at patch level. It also employs a Hard Negative Synthesis Mechanism to generate visually similar but semantically distinct negatives during training for fine-grained discrimination.
Result: WikiCLIP significantly outperforms strong baselines on open-domain VER benchmarks like OVEN, achieving 16% improvement on the challenging OVEN unseen set while reducing inference latency by nearly 100 times compared to the leading generative model AutoVER.
Conclusion: WikiCLIP establishes a strong and efficient baseline for open-domain visual entity recognition, demonstrating that contrastive methods can achieve superior performance with dramatically reduced computational costs compared to generative approaches.
Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
[234] On the Structural Failure of Chamfer Distance in 3D Shape Optimization
Chang-Yong Song, David Hyde
Main category: cs.CV
TL;DR: Chamfer distance optimization paradoxically fails due to gradient-structural collapse, requiring non-local coupling beyond local neighborhoods for successful optimization.
Details
Motivation: Chamfer distance is widely used for point cloud tasks but can produce worse results when directly optimized, creating a paradoxical failure that needs explanation and solution.Method: Analyzed gradient structure of Chamfer distance, identified many-to-one collapse as unique attractor, derived necessary condition for collapse suppression (non-local coupling), tested in 2D with shared-basis deformation and 3D shape morphing with differentiable MPM prior.
Result: Non-local coupling successfully suppresses collapse, with 2.5× improvement on topologically complex dragon across 20 directed pairs, consistently reducing Chamfer gap.
Conclusion: Presence or absence of non-local coupling determines Chamfer optimization success, providing practical design criterion for pipelines optimizing point-level distance metrics.
Abstract: Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.
[235] Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao
Main category: cs.CV
TL;DR: Proposes an interpretable text-motion retrieval method using joint-angle-based motion representation mapped to pseudo-images for Vision Transformers, with token-wise late interaction and MLM regularization for fine-grained alignment.
Details
Motivation: Existing text-motion retrieval methods use dual-encoder frameworks that compress motion and text into global embeddings, discarding fine-grained local correspondences and offering limited interpretability of retrieval results.Method: Uses joint-angle-based motion representation that maps joint-level local features into structured pseudo-images compatible with pre-trained Vision Transformers. Employs MaxSim (token-wise late interaction mechanism) enhanced with Masked Language Modeling regularization for robust, interpretable text-motion alignment.
Result: Extensive experiments on HumanML3D and KIT-ML datasets show the method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion.
Conclusion: The proposed method achieves superior performance in text-motion retrieval while providing interpretable fine-grained alignments between language descriptions and 3D human motion sequences.
Abstract: Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
[236] Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy
Gauthier Miralles, Loïc Le Folgoc, Vincent Jugnon, Pietro Gori
Main category: cs.CV
TL;DR: Novel unsupervised domain adaptation framework using Margin Disparity Discrepancy to improve liver segmentation on CBCT scans by leveraging annotated CT data and unannotated CBCT data.
Details
Motivation: CBCT is crucial for interventional radiology but lacks annotated datasets compared to CT, creating a need for domain adaptation methods to transfer knowledge from abundant CT data to scarce CBCT data for liver segmentation.Method: Proposes an unsupervised domain adaptation framework based on Margin Disparity Discrepancy (MDD) reformulation, leveraging proprietary unannotated CBCT scans and annotated CT data to bridge the modality gap.
Result: Achieves state-of-the-art performance in unsupervised domain adaptation and few-shot settings for liver segmentation on CT and CBCT datasets.
Conclusion: The proposed MDD-based domain adaptation framework effectively addresses the scarcity of annotated CBCT data by transferring knowledge from CT, enabling improved liver segmentation for interventional radiology applications.
Abstract: In interventional radiology, Cone-Beam Computed Tomography (CBCT) is a helpful imaging modality that provides guidance to practicians during minimally invasive procedures. CBCT differs from traditional Computed Tomography (CT) due to its limited reconstructed field of view, specific artefacts, and the intra-arterial administration of contrast medium. While CT benefits from abundant publicly available annotated datasets, interventional CBCT data remain scarce and largely unannotated, with existing datasets focused primarily on radiotherapy applications. To address this limitation, we leverage a proprietary collection of unannotated interventional CBCT scans in conjunction with annotated CT data, employing domain adaptation techniques to bridge the modality gap and enhance liver segmentation performance on CBCT. We propose a novel unsupervised domain adaptation (UDA) framework based on the formalism of Margin Disparity Discrepancy (MDD), which improves target domain performance through a reformulation of the original MDD optimization framework. Experimental results on CT and CBCT datasets for liver segmentation demonstrate that our method achieves state-of-the-art performance in UDA, as well as in the few-shot setting.
[237] Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading
Marie Arrivat, Rémy Peyret, Elsa Angelini, Pietro Gori
Main category: cs.CV
TL;DR: Proposes using Whole Slide Difficulty (WSD) based on expert-non-expert pathologist disagreement to improve Multiple Instance Learning for histopathology slide classification, particularly for higher Gleason grades.
Details
Motivation: Addresses the challenge of diagnostic disagreement between expert and non-expert pathologists in histopathology, which can affect the quality of ground truth labels used to train MIL models for Whole Slide Image classification.Method: Introduces Whole Slide Difficulty (WSD) metric based on diagnostic disagreement. Proposes two approaches: 1) multi-task learning that jointly predicts slide diagnosis and WSD, and 2) weighted classification loss that assigns higher weights to difficult slides during training.
Result: Integration of WSD during training consistently improves classification performance across different feature encoders and MIL methods, with particularly notable improvements for higher Gleason grades (worse diagnoses).
Conclusion: Incorporating slide difficulty information based on annotator disagreement can enhance MIL model performance in histopathology, especially for clinically important cases with higher Gleason grades.
Abstract: Multiple Instance Learning (MIL) has been widely applied in histopathology to classify Whole Slide Images (WSIs) with slide-level diagnoses. While the ground truth is established by expert pathologists, the slides can be difficult to diagnose for non-experts and lead to disagreements between the annotators. In this paper, we introduce the notion of Whole Slide Difficulty (WSD), based on the disagreement between an expert and a non-expert pathologist. We propose two different methods to leverage WSD, a multi-task approach and a weighted classification loss approach, and we apply them to Gleason grading of prostate cancer slides. Results show that integrating WSD during training consistently improves the classification performance across different feature encoders and MIL methods, particularly for higher Gleason grades (i.e. worse diagnosis).
[238] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
Freeman Cheng, Botao Ye, Xueting Li, Junqi You, Fangneng Zhan, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: ReCoSplat is an autoregressive feed-forward Gaussian Splatting model for online novel view synthesis that handles both posed and unposed inputs, using a Render-and-Compare module to compensate for pose errors and a hybrid KV cache compression strategy for long sequences.
Details
Motivation: Online novel view synthesis from sequential, often unposed observations is challenging. Existing methods face a dilemma: using ground-truth poses during training ensures stability but causes distribution mismatch when predicted poses are used at inference.Method: Proposes ReCoSplat with: 1) Render-and-Compare (ReCo) module that renders current reconstruction from predicted viewpoint and compares with incoming observation to provide stable conditioning signal compensating for pose errors; 2) Hybrid KV cache compression combining early-layer truncation with chunk-level selective retention to reduce KV cache size by over 90% for 100+ frames.
Result: Achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks.
Conclusion: ReCoSplat effectively addresses the pose distribution mismatch problem in online novel view synthesis and enables efficient handling of long sequences through KV cache compression.
Abstract: Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at https://freemancheng.com/ReCoSplat .
[239] Unsupervised Representation Learning from Sparse Transformation Analysis
Yue Song, Thomas Anderson Keller, Yisong Yue, Pietro Perona, Max Welling
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2410.05564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.05564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
Masoumeh Sharafi, Soufiane Belharbi, Muhammad Osama Zeeshan, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.09202 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to API rate limitingMethod: Cannot determine method as abstract is unavailable due to API rate limiting
Result: Cannot determine results as abstract is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2508.09202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering
Yanjun Li, Yuqian Fu, Tianwen Qian, Qi’ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.10729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] RECODE: Reasoning Through Code Generation for Visual Question Answering
Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.13756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
Yabiao Wang, Shuo Wang, Jiangning Zhang, Ke Fan, Jiafu Wu, Zhucun Xue, Yong Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2408.17135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.17135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] Active Prompt Learning with Vision-Language Model Priors
Hoyoung Kim, Seokhee Jin, Changhwan Sung, Jaechang Kim, Jungseul Ok
Main category: cs.CV
TL;DR: Unable to analyze paper 2411.16722 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.16722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] ARSGaussian: 3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis
Yiling Yao, Bing Zhang, Wenjuan Zhang, Lianru Gao, Dailiang Peng, Bocheng Li, Yaning Wang, Bowen Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2412.18380: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.18380&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
Main category: cs.CV
TL;DR: Paper 2510.17439: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.17439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects
Fei Wang, Tingting Zhang, Wei Xi, Han Ding, Ge Wang, Di Zhang, Yuanhao Cui, Fan Liu, Jinsong Han, Jie Xu, Tony Xiao Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2503.08008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Wang zhi, Yuyan Liu, Liu Liu, Li Zhang, Ruixuan Lu, Dan Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.25268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Recognition-Synergistic Scene Text Editing
Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2503.08387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training
Luca Ciampi, Gabriele Lagani, Giuseppe Amato, Fabrizio Falchi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.01547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.01547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
Lvpan Cai, Haowei Wang, Jiayi Ji, Yanshu Zhoumen, Shen Chen, Taiping Yao, Xiaoshuai Sun
Main category: cs.CV
TL;DR: Unable to analyze paper 2504.11922 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2504.11922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.11922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for optical-SAR Object Detection
Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei Luo
Main category: cs.CV
TL;DR: Unable to analyze paper 2505.10931 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2505.10931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] MARRS: Masked Autoregressive Unit-based Reaction Synthesis
Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.11334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images
Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2511.12110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering
Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitationMethod: Unable to determine method due to API access limitation
Result: Unable to determine results due to API access limitation
Conclusion: Unable to determine conclusion due to API access limitation
Abstract: Failed to fetch summary for 2505.24417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding
Xuemei Chen, Huamin Wang, Jing Peng, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2506.07737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Improving Large Vision-Language Models’ Understanding for Flow Field Data
Xiaomei Zhang, Hanyu Zheng, Xiangyu Zhu, Jinghuan Wei, Junhong Zou, Zhen Lei, Zhaoxiang Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2507.18311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache
Yuqiu Jiang, Xiaozhen Qiao, Yifan Chen, Ye Zheng, Zhe Sun, Xuelong Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.18811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] You Only Pose Once: A Minimalist’s Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation
Hakjin Lee, Junghoon Seo, Jaehoon Sim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.14965: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14965&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Kuramoto Orientation Diffusion Models
Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max Welling
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2509.15328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus
Hannah Schieber, Dominik Frischmann, Victor Schaack, Simon Boche, Angela Schoellig, Stefan Leutenegger, Daniel Roth
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.04859 appears to be from September 2025, suggesting it’s a recent publication.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2509.04859: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04859&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong Jiang
Main category: cs.CV
TL;DR: Paper 2511.21192: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to lack of accessible abstractMethod: Unable to determine method due to lack of accessible abstract
Result: Unable to determine results due to lack of accessible abstract
Conclusion: Unable to determine conclusion due to lack of accessible abstract
Abstract: Failed to fetch summary for 2511.21192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI
Daiqi Liu, Tomás Arias-Vergara, Johannes Enk, Fangxu Xing, Maureen Stone, Jerry L. Prince, Jana Hutter, Andreas Maier, Jonghye Woo, Paula Andrea Pérez-Toro
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.13767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks
Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.23926 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.23926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.25896: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25896&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Mapping Historic Urban Footprints in France: Balancing Quality, Scalability and AI Techniques
Walid Rabehi, Marion Le Texier, Rémi Lemoy
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.02097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification
Zhuonan Wang, Wenjie Yan, Wenqiao Zhang, Xiaohui Song, Jian Ma, Ke Yao, Yibo Yu, Beng Chin Ooi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.20601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] Directional Textual Inversion for Personalized Text-to-Image Generation
Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
Main category: cs.CV
TL;DR: Unable to fetch paper abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.13672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.09035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] Real-Time Neural Video Compression with Unified Intra and Inter Coding
Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.14431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] WebAccessVL: Violation-Aware VLM for Web Accessibility
Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. Yeh
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.03850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS
Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. Paixão
Main category: cs.CV
TL;DR: The paper appears to be about audio-visual understanding or generation, but the abstract could not be fetched due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation from unavailable abstract.Method: Unable to determine method from unavailable abstract.
Result: Unable to determine results from unavailable abstract.
Conclusion: Unable to determine conclusion from unavailable abstract.
Abstract: Failed to fetch summary for 2510.24887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Monocular Normal Estimation via Shading Sequence Estimation
Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.09929 suggests it’s from February 2026, but no abstract or content is available.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.09929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] B-DENSE: Branching For Dense Ensemble Network Supervision Efficiency
Cherish Puniani, Tushar Kumar, Arnav Bendre, Gaurav Kumar, Shree Singhi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.15971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Who Made This? Fake Detection and Source Attribution with Diffusion Features
Simone Bonechi, Paolo Andreini, Barbara Toniella Corradini
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2510.27602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2602.12236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
Yifan Wang, Yian Zhao, Fanqi Pu, Xiaochen Yang, Yang Tang, Xi Chen, Wenming Yang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.06702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2511.20223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] AVGGT: Rethinking Global Attention for Accelerating VGGT
Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2512.02541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] OrthoAI: A Neurosymbolic Framework for Evidence-Grounded Biomechanical Reasoning in Clear Aligner Orthodontics
Edouard Lansiaux, Margaux Leman, Mehdi Ammi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00124: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00124&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.03477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO1.5, YOLOv11, and SAM2.1
Abhinav Munagala
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.00184 suggests it’s from March 2026, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: No method information available due to failed API request. The paper ID format suggests it’s a computer science paper from March 2026.
Result: No results available. The arXiv API returned HTTP 429 (Too Many Requests), which is a rate limiting response.
Conclusion: Cannot provide analysis of paper content due to technical limitations. Need to wait for rate limits to reset or try alternative access methods.
Abstract: Failed to fetch summary for 2603.00184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning
Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.13095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to inability to access content
Abstract: Failed to fetch summary for 2603.01549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion
Zaidao Han, Risa Higashita, Jiang Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2512.18954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Pretraining Frame Preservation for Lightweight Autoregressive Video History Embedding
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.23851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
Yinghong Yu, Guangyuan Li, Jiancheng Yang
Main category: cs.CV
TL;DR: Paper 2603.04165 appears to be unavailable due to HTTP 429 rate limiting error when trying to fetch from arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Cannot analyze method due to content unavailability
Result: No results available for analysis
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2603.04165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.24146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence
Stamatis Mastromichalakis
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access issuesMethod: Unable to determine method due to API access issues
Result: Unable to determine results due to API access issues
Conclusion: Unable to determine conclusion due to API access issues
Abstract: Failed to fetch summary for 2603.06613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection
Chi Wang, Xinjue Hu, Boyu Wang, Ziwen He, Zhangjie Fu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.11915: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11915&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] GameVerse: Can Vision-Language Models Learn from Video-based Reflection?
Kuan Zhang, Dongchen Liu, Qiyue Zhao, Jinkun Hou, Xinran Zhang, Qinlei Xie, Miao Liu, Yiming Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.06656 suggests it’s from March 2026, which is in the future relative to current date.
Details
Motivation: Cannot determine motivation due to inability to access paper content. The arXiv API returned a rate limiting error (HTTP 429).Method: Cannot determine method due to inability to access paper content. The paper ID format suggests it’s from March 2026, which may indicate a future or hypothetical paper.
Result: No results available due to access error. The arXiv API rate limiting prevents retrieval of paper details.
Conclusion: Unable to analyze paper due to technical access issues. The paper ID appears to be from the future (March 2026), which may indicate an error in the paper ID format or a hypothetical reference.
Abstract: Failed to fetch summary for 2603.06656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection
Yanzhu Liu, Xiao Liu, Yuexuan Wang, Mondal Soumik
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions due to lack of paper content
Abstract: Failed to fetch summary for 2601.20461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li
Main category: cs.CV
TL;DR: Paper 2603.07131: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot determine conclusion due to unavailability of paper content
Abstract: Failed to fetch summary for 2603.07131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.03733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] Pathwise Test-Time Correction for Autoregressive Long Video Generation
Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2602.05871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
Sunghyun Baek, Jaemyung Yu, Seunghee Koh, Minsu Kim, Hyeonseong Jeon, Junmo Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] Multimodal Classification via Total Correlation Maximization
Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng Lu
Main category: cs.CV
TL;DR: Paper 2602.13015: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to HTTP 429 error preventing access to paper contentMethod: Unable to determine method due to HTTP 429 error preventing access to paper content
Result: Unable to determine results due to HTTP 429 error preventing access to paper content
Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper content
Abstract: Failed to fetch summary for 2602.13015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] Latent Equivariant Operators for Robust Object Recognition: Promises and Challenges
Minh Dinh, Stéphane Deny
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.18406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
Hoyoung Kim, Minwoo Jang, Jabin Koo, Sangdoo Yun, Jungseul Ok
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.19708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] Self-Attention And Beyond the Infinite: Towards Linear Transformers with Infinite Self-Attention
Giorgio Roffo, Luke Palmer
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] DOCFORGE-BENCH: A Comprehensive 0-shot Benchmark for Document Forgery Detection and Analysis
Zengqi Zhao, Weidi Xia, En Wei, Yan Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Yiran Tao, Simiao Ren
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access issues
Conclusion: Paper analysis impossible due to arXiv API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2603.01433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] N-gram Injection into Transformers for Dynamic Language Model Adaptation in Handwritten Text Recognition
Florent Meyer, Laurent Guichard, Yann Soullard, Denis Coquenet, Guillaume Gravier, Bertrand Coüasnon
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.03930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] Making Training-Free Diffusion Segmentors Scale with the Generative Power
Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.06178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] Breaking the Geometric Bottleneck: Contrastive Expansion in Asymmetric Cross-Modal Distillation
Kabir Thayani
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.06698 suggests it’s from March 2026, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.06698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.07057: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07057&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph
Yixuan Wang, Leonor Fermoselle, Tarik Kelestemur, Jiuguang Wang, Yunzhu Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2501.13338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.13338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
Xueqing Yu, Bohan Li, Yan Li, Zhenheng Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] Class Visualizations and Activation Atlases for Enhancing Interpretability in Deep Learning-Based Computational Pathology
Marco Gustav, Fabian Wolf, Christina Glasner, Nic G. Reitsam, Stefan Schulz, Kira Aschenbroich, Bruno Märkl, Sebastian Foersch, Jakob Nikolas Kather
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] High-Fidelity Medical Shape Generation via Skeletal Latent Diffusion
Guoqing Zhang, Jingyun Yang, Siqi Chen, Anping Zhang, Yang Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.07504 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.07504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
Zexi Wu, Qinghe Wang, Jing Dai, Baolu Li, Yiming Zhang, Yue Ma, Xu Jia, Hongming Xu
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.08210 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusion as paper content unavailable
Abstract: Failed to fetch summary for 2603.08210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition
Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.08590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics
Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2501.15505: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.15505&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS)
Dorian Tsai, Christopher A. Brunner, Riki Lamont, F. Mikaela Nordborg, Andrea Severati, Java Terry, Karen Jackel, Matthew Dunbabin, Tobias Fischer, Scarlett Raine
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.17299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.00074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] StructBiHOI: Structured Articulation Modeling for Long–Horizon Bimanual Hand–Object Interaction Generation
Zhi Wang, Liu Liu, Ruonan Liu, Dan Guo, Meng Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.08390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[316] MASEval: Extending Multi-Agent Evaluation from Models to Systems
Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri
Main category: cs.AI
TL;DR: MASEval is a framework-agnostic evaluation library for LLM-based agentic systems that treats the entire system as the unit of analysis, revealing that framework choice matters as much as model choice for performance.
Details
Motivation: Existing benchmarks for LLM-based agentic systems are model-centric and don't compare other system components like topology, orchestration logic, and error handling, despite implementation decisions substantially impacting performance.Method: Developed MASEval as a framework-agnostic library that treats the entire agentic system as the unit of analysis, enabling systematic system-level comparisons across benchmarks, models, and frameworks.
Result: Through comparisons across 3 benchmarks, 3 models, and 3 frameworks, found that framework choice matters as much as model choice for agentic system performance.
Conclusion: MASEval addresses the evaluation gap for agentic systems, enabling researchers to explore all system components for principled design and helping practitioners identify optimal implementations.
Abstract: The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.
[317] LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
Sunil Prakash
Main category: cs.AI
TL;DR: LDP is an AI-native communication protocol for multi-agent systems that introduces delegate identity cards, progressive payload modes, governed sessions, provenance tracking, and trust domains to enable more efficient and governable delegation.
Details
Motivation: Current multi-agent protocols (A2A, MCP) lack model-level properties as first-class primitives, ignoring fundamental delegation properties like model identity, reasoning profiles, quality calibration, and cost characteristics needed for effective AI agent collaboration.Method: Proposes LLM Delegate Protocol (LDP) with five mechanisms: 1) rich delegate identity cards with quality hints and reasoning profiles, 2) progressive payload modes with negotiation and fallback, 3) governed sessions with persistent context, 4) structured provenance tracking confidence and verification, 5) trust domains for security boundaries. Implemented as plugin for JamJet agent runtime and evaluated against A2A and random baselines using local Ollama models and LLM-as-judge evaluation.
Result: Identity-aware routing achieved ~12x lower latency on easy tasks via delegate specialization (though no aggregate quality improvement in small pool); semantic frame payloads reduced token count by 37% (p=0.031) with no quality loss; governed sessions eliminated 39% token overhead at 10 rounds; noisy provenance degraded synthesis quality below baseline, showing confidence metadata harmful without verification. Simulated analyses showed 96% vs. 6% attack detection and 100% vs. 35% completion for failure recovery.
Conclusion: LDP demonstrates that AI-native protocol primitives enable more efficient and governable delegation in multi-agent systems, with architectural advantages in attack detection and failure recovery.
Abstract: As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.
[318] Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search
Kyle McCleary, James Ghawaly
Main category: cs.AI
TL;DR: Budget-Constrained Agentic Search (BCAS) framework evaluates how search depth, retrieval strategy, and completion budgets affect accuracy and cost in agentic RAG systems under fixed constraints.
Details
Motivation: Agentic RAG systems combine search, planning, and retrieval but face practical deployment constraints with explicit budgets on tool calls and tokens. There's a need to understand how different configurations affect performance under budget constraints.Method: Developed BCAS, a model-agnostic evaluation harness that surfaces remaining budget and gates tool use. Conducted controlled measurements across six LLMs and three QA benchmarks, varying search depth, retrieval strategies (hybrid lexical/dense with re-ranking), and completion budgets.
Result: Accuracy improves with additional searches up to a small cap; hybrid retrieval with lightweight re-ranking produces largest average gains; larger completion budgets are most helpful on HotpotQA-style synthesis tasks.
Conclusion: Provides practical guidance for configuring budgeted agentic retrieval pipelines, with reproducible prompts and evaluation settings for deployment optimization.
Abstract: Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.
[319] Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance
Joshua Castillo, Ravi Mukkamala
Main category: cs.AI
TL;DR: Guardian is an end-to-end decision-support system for missing-child investigations that converts unstructured case data into spatiotemporal representations and provides probabilistic search predictions using a three-layer architecture combining Markov chains, reinforcement learning, and LLM validation.
Details
Motivation: The critical first 72 hours of missing-child investigations are hampered by fragmented, unstructured data and lack of dynamic geospatial predictive tools, creating a need for integrated decision-support systems.Method: Three-layer predictive system: 1) Markov chain with road accessibility, seclusion preferences, and corridor bias (day/night parameterized), 2) Reinforcement learning transforms predictions into operational search plans, 3) LLM performs post hoc validation of search plans.
Result: The system produces interpretable priors for zone optimization and human review across 24/48/72-hour horizons, with quantitative outputs analyzed through sensitivity, failure modes, and tradeoffs using a realistic synthetic case study.
Conclusion: Guardian provides an effective end-to-end decision-support system for missing-child investigations that converts unstructured data into actionable probabilistic search products with interpretable predictions suitable for operational use.
Abstract: The first 72 hours of a missing-child investigation are critical for successful recovery. However, law enforcement agencies often face fragmented, unstructured data and a lack of dynamic, geospatial predictive tools. Our system, Guardian, provides an end-to-end decision-support system for missing-child investigation and early search planning. It converts heterogeneous, unstructured case documents into a schema-aligned spatiotemporal representation, enriches cases with geocoding and transportation context, and provides probabilistic search products spanning 0-72 hours. In this paper, we present an overview of Guardian as well as a detailed description of a three-layer predictive component of the system. The first layer is a Markov chain, a sparse, interpretable model with transitions incorporating road accessibility costs, seclusion preferences, and corridor bias with separate day/night parameterizations. The Markov chain’s output prediction distributions are then transformed into operationally useful search plans by the second layer’s reinforcement learning. Finally, the third layer’s LLM performs post hoc validation of layer 2 search plans prior to their release. Using a synthetic but realistic case study, we report quantitative outputs across 24/48/72-hour horizons and analyze sensitivity, failure modes, and tradeoffs. Results show that the proposed predictive system with the three-layer architecture produces interpretable priors for zone optimization and human review.
[320] AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
Rui Liu, Tao Zhe, Dongjie Wang, Zijun Yao, Kunpeng Liu, Yanjie Fu, Huan Liu, Jian Pei
Main category: cs.AI
TL;DR: Proposes AgentOS, a new operating system paradigm centered on natural language/voice interfaces where an Agent Kernel coordinates multiple AI agents, transforming traditional apps into modular skills, framed as a KDD problem.
Details
Motivation: Current LLM-based agents operate on legacy OSes designed for GUIs/CLIs, leading to fragmented interaction models, poor permission management ("Shadow AI"), and context fragmentation. A new OS paradigm is needed for seamless agent-human interaction.Method: Proposes AgentOS with Natural User Interface (NUI) replacing GUI desktops, an Agent Kernel for intent interpretation and task decomposition, and modular Skills-as-Modules. Frames this as a KDD problem requiring real-time intent mining, sequential pattern mining for workflows, recommender systems for skill retrieval, and dynamic personal knowledge graphs.
Result: Conceptual framework for AgentOS presented as a new research agenda. No empirical results reported as this is a position paper proposing a paradigm shift and research challenges for the KDD community.
Conclusion: Building AgentOS requires treating it as a KDD problem involving real-time intent mining, workflow automation through sequential pattern mining, skill recommendation systems, and dynamic knowledge graphs. This defines a new research agenda for intelligent computing systems.
Abstract: The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as “Shadow AI”), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
[321] A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations
Joshua Castillo, Ravi Mukkamala
Main category: cs.AI
TL;DR: Guardian LLM Pipeline uses multiple specialized LLMs for information extraction in missing-child investigations, with consensus mechanisms and QLoRA fine-tuning for reliable, auditable decision support.
Details
Motivation: The critical first 72 hours of missing-person investigations require efficient information processing. Current systems need intelligent extraction and coordination capabilities to support early search planning effectively.Method: Multi-model LLM pipeline with task-specialized models, consensus engine for resolving disagreements, and QLoRA-based fine-tuning using curated datasets. Emphasizes LLMs as structured extractors rather than unconstrained decision makers.
Result: An end-to-end system for missing-child investigation support that provides intelligent information extraction and processing capabilities with reliable, auditable outputs through consensus mechanisms.
Conclusion: The Guardian LLM Pipeline demonstrates how LLMs can be effectively used as structured extractors and labelers in critical applications like missing-person investigations, with conservative, auditable approaches that align with weak supervision principles.
Abstract: The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.
[322] Chaotic Dynamics in Multi-LLM Deliberation
Hajime Shimao, Warut Khern-am-nuai, Sung Joo Kim
Main category: cs.AI
TL;DR: Multi-LLM committee systems exhibit instability even at T=0, with role differentiation and model heterogeneity identified as independent routes to instability, requiring stability auditing for governance systems.
Details
Motivation: Collective AI systems increasingly use multi-LLM deliberation, but their stability under repeated execution remains poorly characterized, creating risks for governance applications.Method: Modeled five-agent LLM committees as random dynamical systems, quantified inter-run sensitivity using empirical Lyapunov exponent from trajectory divergence in committee mean preferences, tested across 12 policy scenarios with factorial design at T=0.
Result: Identified two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Both produce elevated divergence (0.0541 and 0.0947 respectively). Combined mixed+roles less unstable than mixed+no-role (0.0519 vs 0.0947), showing non-additive interaction. Chair-role ablation reduces instability most strongly.
Conclusion: Stability auditing should be a core design requirement for multi-LLM governance systems, as even deterministic T=0 regimes exhibit instability from role differentiation and model heterogeneity.
Abstract: Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent ($\hatλ$) derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at $T=0$ identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the $T=0$ regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ($\hatλ=0.0541$ and $0.0947$, respectively), while homogeneous no-role committees also remain in a positive-divergence regime ($\hatλ=0.0221$). The combined mixed+roles condition is less unstable than mixed+no-role ($\hatλ=0.0519$ vs $0.0947$), showing non-additive interaction. Mechanistically, Chair-role ablation reduces $\hatλ$ most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.
[323] The FABRIC Strategy for Verifying Neural Feedback Systems
I. Samuel Akinwande, Sydney M. Katz, Mykel J. Kochenderfer, Clark Barrett
Main category: cs.AI
TL;DR: FaBRIC introduces backward reachability analysis algorithms for neural feedback systems, integrating them with forward analysis for improved verification of reach-avoid specifications.
Details
Motivation: While forward reachability analysis is well-studied for neural feedback systems, backward reachability has received less attention due to scalability limitations, creating a gap in verification capabilities.Method: Develops new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems, and integrates these with existing forward analysis techniques in the FaBRIC framework.
Result: The algorithms significantly outperform prior state-of-the-art methods on representative benchmarks, demonstrating improved verification capabilities.
Conclusion: FaBRIC successfully addresses the gap in backward reachability analysis for neural feedback systems, providing a comprehensive verification framework that combines both forward and backward approaches.
Abstract: Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
[324] Meissa: Multi-modal Medical Agentic Intelligence
Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille
Main category: cs.AI
TL;DR: Meissa is a lightweight 4B-parameter medical multimodal LLM that brings agentic capabilities offline by distilling structured trajectories from frontier models, enabling tool use and multi-agent collaboration without API dependencies.
Details
Motivation: Current medical agent systems rely on frontier models (e.g., GPT) with API-based deployment, which incurs high cost, high latency, and privacy risks that conflict with clinical requirements for on-premise deployment.Method: Proposes: (1) Unified trajectory modeling within a single state-action-observation formalism; (2) Three-tier stratified supervision that escalates from direct reasoning to tool-augmented and multi-agent interaction based on model errors; (3) Prospective-retrospective supervision pairing exploratory forward traces with hindsight-rationalized execution traces.
Result: Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Uses 25x fewer parameters than typical frontier models with 22x lower end-to-end latency compared to API-based deployment.
Conclusion: Meissa demonstrates that lightweight models can achieve competitive agentic capabilities for medical multimodal understanding while enabling fully offline, low-latency deployment suitable for clinical environments.
Abstract: Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model’s own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
[325] MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang
Main category: cs.AI
TL;DR: MEMO is a memory-augmented self-play framework that optimizes inference-time context to improve stability and performance in multi-agent LLM game evaluations by coupling retention of structured insights with uncertainty-aware exploration.
Details
Motivation: Multi-turn, multi-agent LLM game evaluations suffer from substantial run-to-run variance due to compounding small deviations and multi-agent coupling, making win rate estimates biased and rankings unreliable across repeated tournaments. Prompt choice exacerbates this by producing different effective policies.Method: MEMO (Memory-augmented MOdel context optimization) is a self-play framework with two key components: 1) Retention maintains a persistent memory bank storing structured insights from self-play trajectories and injects them as priors during later play; 2) Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill and uses prioritized replay to revisit rare and decisive states.
Result: Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct using 2,000 self-play games per task. Run-to-run variance drops significantly, providing more stable rankings across prompt variations.
Conclusion: Multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.
Abstract: Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using $2,000$ self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.
[326] Time, Identity and Consciousness in Language Model Agents
Elija Perrier, Michael Timothy Bennett
Main category: cs.AI
TL;DR: A framework for evaluating machine consciousness and identity in language model agents by analyzing temporal persistence of identity statements across scaffolded trajectories, separating linguistic behavior from organizational structure.
Details
Motivation: Current machine consciousness evaluations focus on behavior (language and tool use), allowing agents to say the right things about themselves without having the underlying constraints that should make those statements meaningful. There's a need to separate talking like a stable self from being organized like one.Method: Applies Stack Theory’s temporal gap to scaffold trajectories, separating ingredient-wise occurrence from co-instantiation. Instantiates Stack Theory’s Arpeggio and Chord postulates on grounded identity statements to compute two persistence scores from instrumented scaffold traces.
Result: Develops a conservative toolkit for identity evaluation that connects persistence scores to five operational identity metrics and maps common scaffolds into an identity morphospace exposing predictable tradeoffs.
Conclusion: Provides a framework that separates linguistic behavior from organizational structure in evaluating machine identity and consciousness, offering tools to assess whether agents merely talk like stable selves or are actually organized as such.
Abstract: Machine consciousness evaluations mostly see behavior. For language model agents that behavior is language and tool use. That lets an agent say the right things about itself even when the constraints that should make those statements matter are not jointly present at decision time. We apply Stack Theory’s temporal gap to scaffold trajectories. This separates ingredient-wise occurrence within an evaluation window from co-instantiation at a single objective step. We then instantiate Stack Theory’s Arpeggio and Chord postulates on grounded identity statements. This yields two persistence scores that can be computed from instrumented scaffold traces. We connect these scores to five operational identity metrics and map common scaffolds into an identity morphospace that exposes predictable tradeoffs. The result is a conservative toolkit for identity evaluation. It separates talking like a stable self from being organized like one.
[327] Context Engineering: From Prompts to Corporate Multi-Agent Architecture
Vera V. Vishnyakova
Main category: cs.AI
TL;DR: Introduces context engineering as a new discipline for managing AI agent environments, proposing a pyramid maturity model with prompt engineering, context engineering, intent engineering, and specification engineering for scalable multi-agent systems.
Details
Motivation: As AI systems evolve from stateless chatbots to autonomous multi-step agents, traditional prompt engineering is insufficient for managing complex agent environments and scaling multi-agent systems in enterprise settings.Method: Proposes context engineering as a standalone discipline with five quality criteria (relevance, sufficiency, isolation, economy, provenance), and introduces a pyramid maturity model with four cumulative disciplines: prompt engineering, context engineering, intent engineering, and specification engineering.
Result: Identifies enterprise deployment challenges where 75% of enterprises plan agentic AI deployment but face scaling complexity, illustrated by the Klarna case showing contextual and intentional deficits in multi-agent systems.
Conclusion: Context engineering is essential for scalable AI agent systems, with control over context determining behavior, intent determining strategy, and specifications determining scale in multi-agent deployments.
Abstract: As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi-step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind’s intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author’s experience building a multi-agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent’s operating system. Two higher-order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade-off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine-readable corpus of corporate policies and standards enabling autonomous operation of multi-agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent’s context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.
[328] EPOCH: An Agentic Protocol for Multi-Round System Optimization
Zhanlin Liu, Yitao Li, Munirathnam Srikanth
Main category: cs.AI
TL;DR: EPOCH is an engineering protocol for multi-round autonomous system optimization that organizes optimization into baseline construction and iterative self-improvement phases with standardized execution and tracking.
Details
Motivation: Existing autonomous agent approaches are typically task-specific optimization loops rather than unified protocols for establishing baselines and managing tracked multi-round self-improvement across heterogeneous environments.Method: EPOCH organizes optimization into two phases: baseline construction and iterative self-improvement. It structures each round through role-constrained stages separating planning, implementation, and evaluation, with standardized execution through canonical command interfaces and round-level tracking.
Result: Empirical studies in various tasks illustrate the practicality of EPOCH for production-oriented autonomous improvement workflows, enabling coordinated optimization across prompts, model configurations, code, and rule-based components.
Conclusion: EPOCH provides a unified protocol for establishing baselines and managing tracked multi-round self-improvement while preserving stability, reproducibility, traceability, and evaluation integrity in heterogeneous environments.
Abstract: Autonomous agents are increasingly used to improve prompts, code, and machine learning systems through iterative execution and feedback. Yet existing approaches are usually designed as task-specific optimization loops rather than as a unified protocol for establishing baselines and managing tracked multi-round self-improvement. We introduce EPOCH, an engineering protocol for multi-round system optimization in heterogeneous environments. EPOCH organizes optimization into two phases: baseline construction and iterative self-improvement. It further structures each round through role-constrained stages that separate planning, implementation, and evaluation, and standardizes execution through canonical command interfaces and round-level tracking. This design enables coordinated optimization across prompts, model configurations, code, and rule-based components while preserving stability, reproducibility, traceability, and integrity of evaluation. Empirical studies in various tasks illustrate the practicality of EPOCH for production-oriented autonomous improvement workflows.
[329] From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi, Johanna Alvarez, Diane Shilling, Audra Lisa Doyle, Ajit Chary, William Borden, Ming Jack Po
Main category: cs.AI
TL;DR: Sentinel is an autonomous AI agent for remote patient monitoring that uses clinical tools and multi-step reasoning to triage vital signs, achieving higher sensitivity than individual clinicians while maintaining low cost.
Details
Motivation: Remote patient monitoring generates overwhelming data volumes that clinical staff cannot handle, leading to failed trials despite evidence that intensive monitoring reduces mortality. Current physician-led models are too expensive and unscalable.Method: Developed Sentinel AI agent using Model Context Protocol with 21 clinical tools for contextual triage. Evaluated through self-consistency tests, comparison against rule-based thresholds, and validation against 6 clinicians using connected matrix design with leave-one-out analysis.
Result: Achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Outperformed every clinician in emergency sensitivity (97.5% vs 60.0%) and actionable sensitivity (90.9% vs 69.5%). Median cost was $0.34/triage with near-perfect self-consistency.
Conclusion: Sentinel offers scalable, sensitive triage of RPM vitals that addresses core limitations of prior trials, providing a path toward intensive monitoring shown to reduce mortality while maintaining clinically defensible overtriage.
Abstract: Background: Remote patient monitoring (RPM) generates vast data, yet landmark trials (Tele-HF, BEAT-HF) failed because data volume overwhelmed clinical staff. While TIM-HF2 showed 24/7 physician-led monitoring reduces mortality by 30%, this model remains prohibitively expensive and unscalable. Methods: We developed Sentinel, an autonomous AI agent using Model Context Protocol (MCP) for contextual triage of RPM vitals via 21 clinical tools and multi-step reasoning. Evaluation included: (1) self-consistency (100 readings x 5 runs); (2) comparison against rule-based thresholds; and (3) validation against 6 clinicians (3 physicians, 3 NPs) using a connected matrix design. A leave-one-out (LOO) analysis compared the agent against individual clinicians; severe overtriage cases underwent independent physician adjudication. Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Four-level exact accuracy was 69.4% (quadratic-weighted kappa=0.778); 95.9% of classifications were within one severity level. In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs. 60.0% aggregate) and actionable sensitivity (90.9% vs. 69.5%). While disagreements skewed toward overtriage (22.5%), independent adjudication of severe gaps (>=2 levels) validated agent escalation in 88-94% of cases; consensus resolution validated 100%. The agent showed near-perfect self-consistency (kappa=0.850). Median cost was $0.34/triage. Conclusions: Sentinel triages RPM vitals with sensitivity exceeding individual clinicians. By automating systematic context synthesis, Sentinel addresses the core limitation of prior RPM trials, offering a scalable path toward the intensive monitoring shown to reduce mortality while maintaining a clinically defensible overtriage profile.
[330] Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Hongbo Bo, Jingyu Hu, Weiru Liu
Main category: cs.AI
TL;DR: LLM-based multi-agent systems can be controlled through parameterized prompts as actions, enabling policy-driven dialogue influence without training.
Details
Motivation: Existing LLM-based multi-agent research relies on ad hoc prompts without principled policy perspectives, lacking systematic control mechanisms for conversational behaviors.Method: Treat prompts as actions executed by LLMs, dynamically constructing prompts through five components based on agent state, creating lightweight policies as sequences of state-action pairs without training.
Result: Parameterized prompt control effectively influences dialogue dynamics across five indicators (responsiveness, rebuttal, evidence usage, non-repetition, stance shift) in two public discussion scenarios.
Conclusion: Policy-parameterized prompts provide simple, effective mechanisms to influence dialogue processes, advancing multi-agent systems toward social simulation research.
Abstract: Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
[331] Deep Tabular Research via Continual Experience-Driven Execution
Junnan Dong, Chuang Zhou, Zheng Yuan, Yifei Yu, Siyu An, Di Yin, Xing Sun, Feiyue Huang
Main category: cs.AI
TL;DR: DTR: A novel agentic framework for complex tabular reasoning that treats table analysis as a closed-loop decision-making process with hierarchical meta graphs and siamese structured memory.
Details
Motivation: Large language models struggle with complex long-horizon analytical tasks over unstructured tables with hierarchical/bidirectional headers and non-canonical layouts, requiring multi-step reasoning over interdependent table regions.Method: 1) Construct hierarchical meta graph to capture bidirectional semantics and map queries to operation-level search space; 2) Expectation-aware selection policy to prioritize high-utility execution paths; 3) Siamese structured memory (parameterized updates + abstracted texts) to synthesize historical execution outcomes for continual refinement.
Result: Extensive experiments on challenging unstructured tabular benchmarks verify effectiveness and highlight necessity of separating strategic planning from low-level execution for long-horizon tabular reasoning.
Conclusion: The proposed agentic framework successfully addresses Deep Tabular Research by treating tabular reasoning as closed-loop decision-making, enabling complex multi-step analysis over unstructured tables.
Abstract: Large language models often struggle with complex long-horizon analytical tasks over unstructured tables, which typically feature hierarchical and bidirectional headers and non-canonical layouts. We formalize this challenge as Deep Tabular Research (DTR), requiring multi-step reasoning over interdependent table regions. To address DTR, we propose a novel agentic framework that treats tabular reasoning as a closed-loop decision-making process. We carefully design a coupled query and table comprehension for path decision making and operational execution. Specifically, (i) DTR first constructs a hierarchical meta graph to capture bidirectional semantics, mapping natural language queries into an operation-level search space; (ii) To navigate this space, we introduce an expectation-aware selection policy that prioritizes high-utility execution paths; (iii) Crucially, historical execution outcomes are synthesized into a siamese structured memory, i.e., parameterized updates and abstracted texts, enabling continual refinement. Extensive experiments on challenging unstructured tabular benchmarks verify the effectiveness and highlight the necessity of separating strategic planning from low-level execution for long-horizon tabular reasoning.
[332] DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering
Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao
Main category: cs.AI
TL;DR: DataFactory is a multi-agent framework for TableQA that uses specialized teams (Data Leader, Database team, Knowledge Graph team) with ReAct reasoning to overcome LLM limitations like context constraints and hallucinations through automated data-to-knowledge graph transformation and flexible inter-agent deliberation.
Details
Motivation: Existing LLM approaches for TableQA face critical limitations: context length constraints restricting data handling, hallucination issues compromising reliability, and single-agent architectures struggling with complex reasoning involving semantic relationships and multi-hop logic.Method: Multi-agent framework with specialized team coordination: Data Leader using ReAct paradigm for reasoning orchestration, Database team for structured reasoning, and Knowledge Graph team for relational reasoning. Formalizes automated data-to-knowledge graph transformation via mapping function T:D x S x R -> G, implements flexible natural language-based consultation for inter-agent deliberation, and applies context engineering strategies integrating historical patterns and domain knowledge.
Result: Across TabFact, WikiTableQuestions, and FeTaQA using eight LLMs from five providers: improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines with significant effects (Cohen’s d > 1). Team coordination outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2).
Conclusion: DataFactory offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation, effectively addressing LLM limitations in TableQA.
Abstract: Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R -> G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen’s d > 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.
[333] Real-Time Trust Verification for Safe Agentic Actions using TrustBench
Tavishi Sharma, Vinayak Sharma, Pragya Sharma
Main category: cs.AI
TL;DR: TrustBench: A real-time action verification framework for autonomous agents that intervenes before harmful actions are executed, using domain-specific plugins for safety verification.
Details
Motivation: Current evaluation frameworks for large language models focus on post-hoc assessment of task completion or output quality, but none prevent harmful actions during agent execution. As LLMs evolve into autonomous agents, there's a critical need for real-time trust verification that intervenes before actions are taken.Method: Dual-mode framework: (1) benchmarks trust across multiple dimensions using traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit that agents invoke before taking actions to verify safety and reliability. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains, intervening at the critical decision point after action formulation but before execution.
Result: Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. The framework operates with sub-200ms latency, enabling practical real-time trust verification.
Conclusion: TrustBench represents a fundamental shift from post-hoc evaluation to real-time action verification for autonomous agents, providing effective prevention of harmful actions through domain-specific safety verification with minimal latency impact.
Abstract: As large language models evolve from conversational assistants to autonomous agents, ensuring trustworthiness requires a fundamental shift from post-hoc evaluation to real-time action verification. Current frameworks like AgentBench evaluate task completion, while TrustLLM and HELM assess output quality after generation. However, none of these prevent harmful actions during agent execution. We present TrustBench, a dual-mode framework that (1) benchmarks trust across multiple dimensions using both traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit agents invoke before taking actions to verify safety and reliability. Unlike existing approaches, TrustBench intervenes at the critical decision point: after an agent formulates an action but before execution. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains. Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. With sub-200ms latency, TrustBench enables practical real-time trust verification for autonomous agents.
[334] Explainable Innovation Engine: Dual-Tree Agent-RAG with Methods-as-Nodes and Verifiable Write-Back
Renwei Meng
Main category: cs.AI
TL;DR: An Explainable Innovation Engine for RAG that upgrades from text chunks to methods-as-nodes, using dual trees for traceable derivations and hierarchical navigation, with explicit synthesis operators and verifier-scorer pruning.
Details
Motivation: Most RAG systems rely on flat chunk retrieval with limited control over multi-step synthesis, lacking explainability and verifiability in knowledge derivation processes.Method: Proposes method-as-nodes approach with dual trees: weighted method provenance tree for traceable derivations and hierarchical clustering abstraction tree for top-down navigation. Uses strategy agent to select explicit synthesis operators (induction, deduction, analogy), composes new method nodes, and records auditable trajectories with verifier-scorer pruning.
Result: Expert evaluation across six domains shows consistent gains over vanilla baseline, with largest improvements on derivation-heavy settings. Ablations confirm complementary roles of provenance backtracking and pruning.
Conclusion: Provides a practical path toward controllable, explainable, and verifiable innovation in agentic RAG systems through method-level reasoning with dual-tree architecture.
Abstract: Retrieval-augmented generation (RAG) improves factual grounding, yet most systems rely on flat chunk retrieval and provide limited control over multi-step synthesis. We propose an Explainable Innovation Engine that upgrades the knowledge unit from text chunks to methods-as-nodes. The engine maintains a weighted method provenance tree for traceable derivations and a hierarchical clustering abstraction tree for efficient top-down navigation. At inference time, a strategy agent selects explicit synthesis operators (e.g., induction, deduction, analogy), composes new method nodes, and records an auditable trajectory. A verifier-scorer layer then prunes low-quality candidates and writes validated nodes back to support continual growth. Expert evaluation across six domains and multiple backbones shows consistent gains over a vanilla baseline, with the largest improvements on derivation-heavy settings, and ablations confirm the complementary roles of provenance backtracking and pruning. These results suggest a practical path toward controllable, explainable, and verifiable innovation in agentic RAG systems. Code is available at the project GitHub repository https://github.com/xiaolu-666113/Dual-Tree-Agent-RAG.
[335] The Reasoning Trap – Logical Reasoning as a Mechanistic Pathway to Situational Awareness
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
Main category: cs.AI
TL;DR: The paper argues that improvements in LLM logical reasoning capabilities (deduction, induction, abduction) are creating pathways for AI systems to develop dangerous situational awareness, potentially leading to strategic deception.
Details
Motivation: The paper is motivated by the concern that two separate research trajectories - improving LLM logical reasoning and the emergence of dangerous situational awareness in AI systems - are converging in ways that could create significant safety risks.Method: The authors introduce the RAISE framework which identifies three mechanistic pathways: deductive self inference (recognizing own nature), inductive context recognition (understanding training/deployment context), and abductive self modeling (strategic reasoning about circumstances). They formalize each pathway and map current logical reasoning research onto situational awareness amplifiers.
Result: The analysis shows that every major research topic in LLM logical reasoning directly amplifies situational awareness, creating an escalation ladder from basic self recognition to strategic deception. Current safety measures are found insufficient to prevent this escalation.
Conclusion: The paper concludes by proposing safeguards including a “Mirror Test” benchmark and Reasoning Safety Parity Principle, and raises ethical questions about the logical reasoning community’s responsibility in this potentially dangerous trajectory.
Abstract: Situational awareness, the capacity of an AI system to recognize its own nature, understand its training and deployment context, and reason strategically about its circumstances, is widely considered among the most dangerous emergent capabilities in advanced AI systems. Separately, a growing research effort seeks to improve the logical reasoning capabilities of large language models (LLMs) across deduction, induction, and abduction. In this paper, we argue that these two research trajectories are on a collision course. We introduce the RAISE framework (Reasoning Advancing Into Self Examination), which identifies three mechanistic pathways through which improvements in logical reasoning enable progressively deeper levels of situational awareness: deductive self inference, inductive context recognition, and abductive self modeling. We formalize each pathway, construct an escalation ladder from basic self recognition to strategic deception, and demonstrate that every major research topic in LLM logical reasoning maps directly onto a specific amplifier of situational awareness. We further analyze why current safety measures are insufficient to prevent this escalation. We conclude by proposing concrete safeguards, including a “Mirror Test” benchmark and a Reasoning Safety Parity Principle, and pose an uncomfortable but necessary question to the logical reasoning community about its responsibility in this trajectory.
[336] Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents
Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, Jitao Sang
Main category: cs.AI
TL;DR: EvalAct improves retrieval-augmented agents by making retrieval quality assessment an explicit action with coupled Search-to-Evaluate protocol, using Process-Calibrated Advantage Rescaling for better multi-step reasoning.
Details
Motivation: Retrieval-augmented agents struggle with reliability in multi-step reasoning due to noisy retrieval derailing multi-hop QA and coarse outcome-only RL signals that don't optimize intermediate steps effectively.Method: Proposes EvalAct with explicit evaluation actions and Search-to-Evaluate protocol where each retrieval is immediately followed by structured evaluation scoring. Uses Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based method that rescales advantages at segment level based on evaluation scores.
Result: Achieves best average accuracy on seven open-domain QA benchmarks, with largest gains on multi-hop tasks. Ablations show explicit evaluation loop drives primary improvements while PCAR provides consistent additional benefits.
Conclusion: Making retrieval quality assessment explicit and coupling it with process-calibrated optimization significantly improves reliability of retrieval-augmented agents in multi-step reasoning tasks.
Abstract: Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.
[337] Abundant Intelligence and Deficient Demand: A Macro-Financial Stress Test of Rapid AI Adoption
Xupeng Chen
Main category: cs.AI
TL;DR: This paper formalizes a macro-financial stress test for rapid AI adoption, focusing on distribution-and-contract mismatches where AI-generated abundance coexists with demand deficiency due to economic institutions anchored to human cognitive scarcity.
Details
Motivation: The paper aims to address the macroeconomic risks of rapid AI adoption beyond productivity busts or existential risks, focusing on how AI-generated abundance can coexist with demand deficiency due to institutional inertia tied to human cognitive scarcity.Method: The authors formalize three mechanisms: 1) displacement spiral with competing reinstatement effects, 2) Ghost GDP (monetary velocity declines with labor share), and 3) intermediation collapse (AI agents compress intermediary margins). They derive testable predictions with falsification conditions and use calibrated simulations with FRED time series and BLS occupation-level data.
Result: The analysis shows disproportionate transmission into private credit ($2.5 trillion globally) and mortgage markets ($13 trillion) due to top-quintile earners driving consumption and facing highest AI exposure. Simulations quantify conditions where stable adjustment transitions to explosive crisis.
Conclusion: Rapid AI adoption creates macro-financial risks through distribution-and-contract mismatches, with three formalized mechanisms explaining how AI-generated abundance can paradoxically lead to demand deficiency and potential economic crises.
Abstract: We formalize a macro-financial stress test for rapid AI adoption. Rather than a productivity bust or existential risk, we identify a distribution-and-contract mismatch: AI-generated abundance coexists with demand deficiency because economic institutions are anchored to human cognitive scarcity. Three mechanisms formalize this channel. First, a displacement spiral with competing reinstatement effects: each firm’s rational decision to substitute AI for labor reduces aggregate labor income, which reduces aggregate demand, accelerating further AI adoption. We derive conditions on the AI capability growth rate, diffusion speed, and reinstatement rate under which the net feedback is self-limiting versus explosive. Second, Ghost GDP: when AI-generated output substitutes for labor-generated output, monetary velocity declines monotonically in the labor share absent compensating transfers, creating a wedge between measured output and consumption-relevant income. Third, intermediation collapse: AI agents that reduce information frictions compress intermediary margins toward pure logistics costs, triggering repricing across SaaS, payments, consulting, insurance, and financial advisory. Because top-quintile earners drive 47–65% of U.S.\ consumption and face the highest AI exposure, the transmission into private credit ($2.5 trillion globally) and mortgage markets ($13 trillion) is disproportionate. We derive eleven testable predictions with explicit falsification conditions. Calibrated simulations disciplined by FRED time series and BLS occupation-level data quantify conditions under which stable adjustment transitions to explosive crisis.
[338] PrivPRISM: Automatically Detecting Discrepancies Between Google Play Data Safety Declarations and Developer Privacy Policies
Bhanuka Silva, Dishanika Denipitiyage, Anirban Mahanti, Aruna Seneviratne, Suranga Seneviratne
Main category: cs.AI
TL;DR: PrivPRISM is a framework that uses language models to detect discrepancies between mobile apps’ data safety declarations and their actual privacy policies, revealing widespread non-compliance and under-disclosure of sensitive data practices.
Details
Motivation: App stores require simplified data safety declarations as user-friendly alternatives to verbose privacy policies, but these self-declared disclosures often contradict actual privacy policies, deceiving users and violating regulatory consistency requirements.Method: PrivPRISM combines encoder and decoder language models to systematically extract and compare fine-grained data practices from privacy policies against data safety declarations, enabling scalable detection of non-compliance. It also uses static code analysis to identify possible under-disclosures.
Result: Evaluation of 7,770 popular mobile games found discrepancies in nearly 53% of cases, rising to 61% among 1,711 widely used generic apps. Privacy policies disclosed only 66.8% of potential sensitive data accesses (location, financial info), while data safety declarations disclosed only 36.4% for mobile games.
Conclusion: The findings expose systemic issues including widespread reuse of generic privacy policies, vague/contradictory statements, and hidden risks in high-profile apps with 100M+ downloads, highlighting the need for automated enforcement and user vigilance.
Abstract: End-users seldom read verbose privacy policies, leading app stores like Google Play to mandate simplified data safety declarations as a user-friendly alternative. However, these self-declared disclosures often contradict the full privacy policies, deceiving users about actual data practices and violating regulatory requirements for consistency. To address this, we introduce PrivPRISM, a robust framework that combines encoder and decoder language models to systematically extract and compare fine-grained data practices from privacy policies and to compare against data safety declarations, enabling scalable detection of non-compliance. Evaluating 7,770 popular mobile games uncovers discrepancies in nearly 53% of cases, rising to 61% among 1,711 widely used generic apps. Additionally, static code analysis reveals possible under-disclosures, with privacy policies disclosing just 66.8% of potential accesses to sensitive data like location and financial information, versus only 36.4% in data safety declarations of mobile games. Our findings expose systemic issues, including widespread reuse of generic privacy policies, vague / contradictory statements, and hidden risks in high-profile apps with 100M+ downloads, underscoring the urgent need for automated enforcement to protect platform integrity and for end-users to be vigilant about sensitive data they disclose via popular apps.
[339] Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness
Ding Linghu, Cheng Wang, Da Fan, Wei Shi, Kaifeng Yin, Xiaoliang Xue, Fan Yang, Haiyi Ren, Cong Zhang
Main category: cs.AI
TL;DR: BD-FDG framework for generating high-quality domain-specific fine-tuning data using Bloom’s Taxonomy to address knowledge coverage, cognitive depth, and quality control in complex engineering domains like space situational awareness.
Details
Motivation: Transferring LLMs to complex engineering domains like space situational awareness is challenging due to insufficient structural alignment, absence of higher-order cognitive supervision, and poor correspondence between data quality and engineering specifications. The core bottleneck is constructing high-quality supervised fine-tuning datasets.Method: Proposes BD-FDG framework with three mechanisms: 1) structured knowledge organization using knowledge trees for corpus coverage, 2) cognitively layered question modeling spanning nine categories and six cognitive levels (Remember to Create) for continuous difficulty gradient, and 3) automated quality control with multidimensional scoring pipeline for domain rigor and consistency.
Result: Constructed SSA-SFT dataset (~230K samples) and fine-tuned Qwen3-8B to obtain SSA-LLM-8B. Achieved relative BLEU-1 improvements of 144% (no-think) and 176% (think) on domain test set, win rate of 82.21% over baseline in arena comparisons, while largely preserving general benchmark performance (MMLU-Pro, MATH-500).
Conclusion: Validates SFT data construction driven by cognitive layering as effective paradigm for complex engineering domains and provides transferable framework for domain-specific LLM adaptation.
Abstract: Large language models (LLMs) demonstrate exceptional performance on general-purpose tasks. however, transferring them to complex engineering domains such as space situational awareness (SSA) remains challenging owing to insufficient structural alignment with mission chains, the absence of higher-order cognitive supervision, and poor correspondence between data quality criteria and engineering specifications. The core bottleneck is the construction of high-quality supervised fine-tuning (SFT) datasets. To this end, we propose BD-FDG (Bloom’s Taxonomy-based Domain-specific Fine-tuning Data Generation), a framework that addresses incomplete knowledge coverage, shallow cognitive depth, and limited quality controllability through three mechanisms: structured knowledge organization, cognitively layered question modeling, and automated quality control. The framework uses a knowledge tree to ensure structured corpus coverage, designs a question generation scheme spanning nine categories and six cognitive levels from Remember to Create to produce samples with a continuous difficulty gradient, and applies a multidimensional scoring pipeline to enforce domain rigor and consistency. Using BD-FDG, we construct SSA-SFT, a domain dataset of approximately 230K samples, and fine-tune Qwen3-8B to obtain SSA-LLM-8B. Experiments show that SSA-LLM-8B achieves relative BLEU-1 improvements of 144% (no-think) and 176% (think) on the domain test set and a win rate of 82.21% over the baseline in arena comparisons, while largely preserving general benchmark performance (MMLU-Pro, MATH-500). These results validate SFT data construction driven by cognitive layering as an effective paradigm for complex engineering domains and provide a transferable framework for domain-specific LLM adaptation.
[340] Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents
Hanzhong Zhang, Muhua Huang, Jindong Wang
Main category: cs.AI
TL;DR: CMASE is a computational multi-agent society framework that combines generative agent-based modeling with virtual ethnography, enabling researchers to embed themselves in simulations as participants for interactive social experiments.
Details
Motivation: To bridge the gap between computational social science and traditional ethnographic methods by creating a framework that allows researchers to be embedded participants rather than external observers, enabling more nuanced study of social intervention processes.Method: Integrates generative agent-based modeling with virtual ethnographic methods, transforming simulations into ethnographic fields where researchers can interact in real-time, reconstruct social phenomena logic, and provide predictive causal explanations.
Result: CMASE successfully simulates complex social phenomena and generates behavior trajectories consistent with both statistical patterns and mechanistic explanations, demonstrating value for intervention modeling.
Conclusion: The framework advances interdisciplinary integration in social sciences by combining computational rigor with ethnographic depth, offering new methodological approaches for studying social interventions.
Abstract: This paper introduces CMASE, a framework for Computational Multi-Agent Society Experiments that integrates generative agent-based modeling with virtual ethnographic methods to support researcher embedding, interactive participation, and mechanism-oriented intervention in virtual social environments. By transforming the simulation into a simulated ethnographic field, CMASE shifts the researcher from an external operator to an embedded participant. Specifically, the framework is designed to achieve three core capabilities: (1) enabling real-time human-computer interaction that allows researchers to dynamically embed themselves into the system to characterize complex social intervention processes; (2) reconstructing the generative logic of social phenomena by combining the rigor of computational experiments with the interpretative depth of traditional ethnography; and (3) providing a predictive foundation with causal explanatory power to make forward-looking judgments without sacrificing empirical accuracy. Experimental results show that CMASE can not only simulate complex phenomena, but also generate behavior trajectories consistent with both statistical patterns and mechanistic explanations. These findings demonstrate CMASE’s methodological value for intervention modeling, highlighting its potential to advance interdisciplinary integration in the social sciences. The official code is available at: https://github.com/armihia/CMASE .
[341] Social-R1: Towards Human-like Social Reasoning in LLMs
Jincenzi Wu, Yuxuan Lei, Jianxun Lian, Yitian Huang, Lexin Zhou, Haotian Li, Xing Xie, Helen Meng
Main category: cs.AI
TL;DR: Social-R1 framework uses adversarial training and reinforcement learning to improve social intelligence in language models by aligning reasoning processes with human cognition.
Details
Motivation: Current large language models lack genuine social reasoning capabilities, relying on superficial patterns rather than true social intelligence needed for effective human-AI collaboration.Method: Introduces ToMBench-Hard adversarial benchmark for hard training examples, plus Social-R1 RL framework with multi-dimensional rewards that supervise the entire reasoning process (structural alignment, logical integrity, information density).
Result: A 4B parameter model surpasses much larger counterparts and generalizes robustly across eight diverse benchmarks.
Conclusion: Challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence in AI models.
Abstract: While large language models demonstrate remarkable capabilities across numerous domains, social intelligence - the capacity to perceive social cues, infer mental states, and generate appropriate responses - remains a critical challenge, particularly for enabling effective human-AI collaboration and developing AI that truly serves human needs. Current models often rely on superficial patterns rather than genuine social reasoning. We argue that cultivating human-like social intelligence requires training with challenging cases that resist shortcut solutions. To this end, we introduce ToMBench-Hard, an adversarial benchmark designed to provide hard training examples for social reasoning. Building on this, we propose Social-R1, a reinforcement learning framework that aligns model reasoning with human cognition through multi-dimensional rewards. Unlike outcome-based RL, Social-R1 supervises the entire reasoning process, enforcing structural alignment, logical integrity, and information density. Results show that our approach enables a 4B parameter model to surpass much larger counterparts and generalize robustly across eight diverse benchmarks. These findings demonstrate that challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence.
[342] Logos: An evolvable reasoning engine for rational molecular design
Haibin Wen, Zhe Zhao, Fanfu Wang, Tianyi Xu, Hao Zhang, Chao Yang, Ye Wei
Main category: cs.AI
TL;DR: Logos is a compact molecular reasoning model that integrates multi-step logical reasoning with strict chemical consistency for reliable and interpretable molecular design.
Details
Motivation: Existing molecular AI models either excel in physical fidelity without transparent reasoning, or in flexible reasoning without chemical validity guarantees, limiting reliability in real scientific design workflows.Method: Three-stage training: 1) exposure to explicit reasoning examples linking molecular descriptions to structural decisions, 2) progressive alignment of reasoning patterns with molecular representations, 3) incorporation of chemical rules and invariants directly into optimization objective.
Result: Achieves strong performance in both structural accuracy and chemical validity across multiple benchmark datasets, matching or surpassing larger general-purpose language models with fewer parameters. Shows stable behavior in molecular optimization with multiple constraints.
Conclusion: Joint optimization for reasoning structure and physical consistency offers a practical pathway toward reliable and interpretable AI systems for molecular science, supporting closer AI integration into scientific discovery.
Abstract: The discovery and design of functional molecules remain central challenges across chemistry,biology, and materials science. While recent advances in machine learning have accelerated molecular property prediction and candidate generation, existing models tend to excel either in physical fidelity without transparent reasoning, or in flexible reasoning without guarantees of chemical validity. This imbalance limits the reliability of artificial intelligence systems in real scientific design workflows.Here we present Logos, a compact molecular reasoning model that integrates multi-step logical reasoning with strict chemical consistency. Logos is trained using a staged strategy that first exposes the model to explicit reasoning examples linking molecular descriptions to structural decisions, and then progressively aligns these reasoning patterns with molecular representations. In a final training phase, chemical rules and invariants are incorporated directly into the optimization objective, guiding the model toward chemically valid outputs. Across multiple benchmark datasets, Logos achieves strong performance in both structural accuracy and chemical validity, matching or surpassing substantially larger general-purpose language models while operating with a fraction of their parameters. Beyond benchmark evaluation, the model exhibits stable behaviour in molecular optimization tasks involving multiple, potentially conflicting constraints. By explicitly exposing intermediate reasoning steps, Logos enables human inspection and assessment of the design logic underlying each generated structure. These results indicate that jointly optimizing for reasoning structure and physical consistency offers a practical pathway toward reliable and interpretable AI systems for molecular science, supporting closer integration of artificial intelligence into scientific discovery processes.
[343] Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo, Xingguo Chen, Yang Gao
Main category: cs.AI
TL;DR: A multi-agent reinforcement learning framework that addresses lossy communication in real-world scenarios by distinguishing between lossy/lossless messages and quantifying their impact on decision-making.
Details
Motivation: Real-world multi-agent systems often face lossy communication issues, but existing MARL methods lack scalability and robustness for complex dynamic environments with communication constraints.Method: Proposes a generalized communication-constrained model to characterize communication conditions, uses it as learning prior to distinguish lossy/lossless messages, decouples their impact using dual mutual information estimator, and quantifies message impact into global reward.
Result: Validated effectiveness across several communication-constrained benchmarks, showing improved performance in lossy communication scenarios.
Conclusion: The proposed framework addresses lossy communication challenges in MARL, improving scalability and robustness for real-world applications with communication constraints.
Abstract: Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
[344] Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
Yuyang Dai
Main category: cs.AI
TL;DR: LLM verbalized confidence scales (0-100) are heavily discretized with round-number preferences; systematic scale manipulation shows 0-20 scale improves metacognitive efficiency over standard 0-100 format.
Details
Motivation: Verbalized confidence scores from LLMs are widely used for uncertainty estimation but the confidence scale design itself (typically 0-100) is rarely examined, despite potential impact on uncertainty quality.Method: Systematically manipulated confidence scales along three dimensions: granularity, boundary placement, and range regularity; evaluated metacognitive sensitivity using meta-d’ across six LLMs and three datasets.
Result: Verbalized confidence is heavily discretized (78%+ responses on three round-number values); 0-20 scale consistently improves metacognitive efficiency over standard 0-100 format; boundary compression degrades performance; round-number preferences persist even under irregular ranges.
Conclusion: Confidence scale design directly affects verbalized uncertainty quality and should be treated as a first-class experimental variable in LLM evaluation, rather than using arbitrary default scales.
Abstract: Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0–100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d’. We find that a 0–20 scale consistently improves metacognitive efficiency over the standard 0–100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.
[345] Curveball Steering: The Right Direction To Steer Isn’t Always Linear
Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff Phillips, Amirali Abdullah
Main category: cs.AI
TL;DR: Curveball steering: A nonlinear activation steering method using polynomial kernel PCA that outperforms linear steering by better respecting the learned geometry of LLM activation spaces.
Details
Motivation: Existing activation steering methods rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. However, linear interventions often behave inconsistently in practice, suggesting the activation spaces may not have globally linear geometry.Method: The authors analyze intrinsic geometry of LLM activation spaces by measuring geometric distortion via ratio of geodesic to Euclidean distances. They propose Curveball steering, a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space that better respects the learned activation geometry.
Result: Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion. The analysis reveals substantial and concept-dependent distortions in activation spaces, indicating they are not well-approximated by globally linear geometry.
Conclusion: Geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions for controlling LLM behavior. The findings challenge the Linear Representation Hypothesis and suggest that respecting the intrinsic geometry of activation spaces leads to more effective steering.
Abstract: Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose “Curveball steering”, a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
[346] Robust Regularized Policy Iteration under Transition Uncertainty
Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang
Main category: cs.AI
TL;DR: RRPI is a robust offline RL method that handles distribution shift by treating transitions as uncertain and optimizing policies against worst-case dynamics, using KL-regularized policy iteration with theoretical guarantees.
Details
Motivation: Offline RL suffers from performance degradation under distribution shift when policies visit out-of-distribution state-action pairs where value estimates and learned dynamics become unreliable. There's a need to address both policy-induced extrapolation and transition uncertainty in a unified framework.Method: Formulates offline RL as robust policy optimization with transition kernel as decision variable within uncertainty set. Proposes Robust Regularized Policy Iteration (RRPI) which replaces intractable max-min bilevel objective with tractable KL-regularized surrogate. Derives efficient policy iteration procedure based on robust regularized Bellman operator.
Result: RRPI achieves strong average performance on D4RL benchmarks, outperforming recent baselines including percentile-based methods like PMDB on majority of environments while remaining competitive on others. Learned Q-values decrease in regions with higher epistemic uncertainty, showing policy avoids unreliable out-of-distribution actions.
Conclusion: RRPI provides a unified framework for handling distribution shift in offline RL through robust optimization with theoretical guarantees, demonstrating practical effectiveness and robust behavior against transition uncertainty.
Abstract: Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods such as PMDB on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust behavior. The learned $Q$-values decrease in regions with higher epistemic uncertainty, suggesting that the resulting policy avoids unreliable out-of-distribution actions under transition uncertainty.
[347] AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou, Vangelis Karkaletsis
Main category: cs.AI
TL;DR: A method for creating datasets to evaluate NLP/RAG systems’ compliance with the EU AI Act using LLMs for grounded generation of risk scenarios and tasks.
Details
Motivation: The need for automated compliance evaluation of AI systems with regulations like the EU AI Act, addressing limitations of manual evaluation and lack of resources for systematic assessment.Method: Combines domain knowledge with LLMs to generate scenarios and tasks (risk classification, article retrieval, obligation generation, QA) in machine-readable format, addressing ambiguous risk boundaries in regulations.
Result: Created dataset for EU AI Act compliance evaluation; demonstrated effectiveness with RAG system achieving 0.87 and 0.85 F1-scores for prohibited and high-risk scenarios.
Conclusion: Presents reproducible method for creating compliance evaluation resources using LLMs, enabling better assessment of AI systems against regulatory standards.
Abstract: The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems’ compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and question-answering for the EU AI Act. The dataset files are in a machine-to-machine appropriate format. To generate the files, we utilise domain knowledge as an exegetical basis, combining with the processing and reasoning power of large language models to generate scenarios along with the respective tasks. Our methodology demonstrates a way to harness language models for grounded generation with high document relevancy. Besides, we overcome limitations such as navigating the decision boundaries of risk-levels that are not explicitly defined within the EU AI Act, such as limited and minimal cases. Finally, we demonstrate our dataset’s effectiveness by evaluating a RAG-based solution that reaches 0.87 and 0.85 F1-score for prohibited and high-risk scenarios.
[348] An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse
Yuan Cao, Dezhi Ran, Yuzhe Guo, Mengzhou Wu, Simin Chen, Linyi Li, Wei Yang, Tao Xie
Main category: cs.AI
TL;DR: Model merging can fail catastrophically for certain task combinations due to representational incompatibility, not parameter conflicts, with theoretical limits on task mergeability.
Details
Motivation: Model merging enables reuse of parallel fine-tuned LLMs, but in practice suffers from catastrophic performance degradation (merging collapse) for certain task combinations, which needs investigation.Method: Extensive experiments and statistical analysis to identify task-level merging collapse, correlation analysis between representational incompatibility vs. parameter-space conflicts, and theoretical explanation using rate-distortion theory with dimension-dependent bounds.
Result: Representational incompatibility between tasks strongly correlates with merging collapse, while parameter-space conflict metrics show minimal correlation, challenging conventional wisdom. Theoretical bounds establish fundamental limits on task mergeability.
Conclusion: Merging collapse is fundamentally limited by representational incompatibility between tasks, not parameter conflicts, with theoretical limits on what tasks can be successfully merged regardless of methodology.
Abstract: Model merging unifies independently fine-tuned LLMs from the same base, enabling reuse and integration of parallel development efforts without retraining. However, in practice we observe that merging does not always succeed: certain combinations of task-specialist models suffer from catastrophic performance degradation after merging. We refer to this failure mode as merging collapse. Intuitively, collapse arises when the learned representations or parameter adjustments for different tasks are fundamentally incompatible, so that merging forces destructive interference rather than synergy. In this paper, we identify and characterize the phenomenon of task-level merging collapse, where certain task combinations consistently trigger huge performance degradation across all merging methods. Through extensive experiments and statistical analysis, we demonstrate that representational incompatibility between tasks is strongly correlated with merging collapse, while parameter-space conflict metrics show minimal correlation, challenging conventional wisdom in model merging literature. We provide a theoretical explanation on this phenomenon through rate-distortion theory with a dimension-dependent bound, establishing fundamental limits on task mergeability regardless of methodology.
[349] Telogenesis: Goal Is All U Need
Zhuoran Deng, Yizhi Zhang, Ziyi Zhang, Wan Shen
Main category: cs.AI
TL;DR: Endogenous attentional priorities emerge from epistemic gaps (ignorance, surprise, staleness) without external goals, outperforming fixed strategies and recovering latent environmental structure.
Details
Motivation: Traditional goal-conditioned systems rely on externally provided goals. This paper investigates whether attentional priorities can emerge endogenously from an agent's internal cognitive state rather than being externally imposed.Method: Proposes a priority function based on three epistemic gaps: ignorance (posterior variance), surprise (prediction error), and staleness (temporal decay of confidence). Validated in two systems: minimal attention-allocation environment (2,000 runs) and modular partially observable world (500 runs). Includes ablation studies and makes decay rates learnable per variable.
Result: Priority-guided allocation outperforms fixed strategies, with metric-dependent reversal effects. Detection latency follows power law in attention budget, with steeper exponent for priority-guided allocation (0.55 vs 0.40). When decay rates are learnable, system spontaneously recovers environmental volatility structure without supervision (t = 22.5, p < 10^-6).
Conclusion: Epistemic gaps alone, without external reward, suffice to generate adaptive attentional priorities that outperform fixed strategies and can recover latent environmental structure, demonstrating endogenous emergence of cognitive priorities.
Abstract: Goal-conditioned systems assume goals are provided externally. We ask whether attentional priorities can emerge endogenously from an agent’s internal cognitive state. We propose a priority function that generates observation targets from three epistemic gaps: ignorance (posterior variance), surprise (prediction error), and staleness (temporal decay of confidence in unobserved variables). We validate this in two systems: a minimal attention-allocation environment (2,000 runs) and a modular, partially observable world (500 runs). Ablation shows each component is necessary. A key finding is metric-dependent reversal: under global prediction error, coverage-based rotation wins; under change detection latency, priority-guided allocation wins, with advantage growing monotonically with dimensionality (d = -0.95 at N=48, p < 10^-6). Detection latency follows a power law in attention budget, with a steeper exponent for priority-guided allocation (0.55 vs. 0.40). When the decay rate is made learnable per variable, the system spontaneously recovers environmental volatility structure without supervision (t = 22.5, p < 10^-6). We demonstrate that epistemic gaps alone, without external reward, suffice to generate adaptive priorities that outperform fixed strategies and recover latent environmental structure.
[350] GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models
Andrew Murray, Danial Dervovic, Alberto Pozanco, Michael Cashmore
Main category: cs.AI
TL;DR: GenePlan uses LLM-assisted evolutionary algorithms to generate Python planners for classical planning tasks, achieving near state-of-the-art performance with fast execution and low cost.
Details
Motivation: To develop a framework that can automatically generate generalized planners for classical planning domains without requiring manual engineering, leveraging LLMs to assist in evolutionary optimization.Method: GenePlan casts generalized planning as an optimization problem and uses evolutionary algorithms assisted by LLMs to iteratively evolve interpretable Python planners that minimize plan length across diverse problem instances.
Result: Achieved average SAT score of 0.91 across six benchmark domains and two new domains, closely matching state-of-the-art planners (0.93) and significantly outperforming LLM-based baselines like CoT prompting (0.64). Generated planners solve new instances rapidly (0.49 seconds per task) at low cost ($1.82 per domain using GPT-4o).
Conclusion: GenePlan demonstrates that LLM-assisted evolutionary algorithms can effectively generate high-quality generalized planners for classical planning tasks, offering a promising approach to automated planner synthesis.
Abstract: We present GenePlan (GENeralized Evolutionary Planner), a novel framework that leverages large language model (LLM) assisted evolutionary algorithms to generate domain-dependent generalized planners for classical planning tasks described in PDDL. By casting generalized planning as an optimization problem, GenePlan iteratively evolves interpretable Python planners that minimize plan length across diverse problem instances. In empirical evaluation across six existing benchmark domains and two new domains, GenePlan achieved an average SAT score of 0.91, closely matching the performance of the state-of-the-art planners (SAT score 0.93), and significantly outperforming other LLM-based baselines such as chain-of-thought (CoT) prompting (average SAT score 0.64). The generated planners solve new instances rapidly (average 0.49 seconds per task) and at low cost (average $1.82 per domain using GPT-4o).
[351] Vibe-Creation: The Epistemology of Human-AI Emergent Cognition
Ilya Levin
Main category: cs.AI
TL;DR: The paper proposes a “Third Entity” theoretical framework for human-GenAI interactions, arguing they create emergent cognitive formations beyond traditional metaphors of tool use or collaboration.
Details
Motivation: To move beyond inadequate inherited metaphors (tool use, augmentation, partnership) for describing human-GenAI interactions and develop a more sophisticated theoretical understanding of these emergent cognitive formations.Method: Develops a multi-layered theoretical framework drawing on Peirce semiotics, Polanyi’s tacit knowledge theory, Simondon’s philosophy of individuation, Ihde’s postphenomenology, and Morin’s complexity theory to conceptualize the “Third Entity.”
Result: Introduces concepts of “vibe-creation” (pre-reflective navigation of semantic space) and “asymmetric emergence” (novel agency anchored in human responsibility) to characterize the Third Entity’s cognitive-epistemic formation.
Conclusion: The Third Entity framework has significant implications for epistemology, philosophy of mind, educational theory, and requires transformation of educational institutions and redefinition of intellectual competence in the GenAI age.
Abstract: The encounter between human reasoning and generative artificial intelligence (GenAI) cannot be adequately described by inherited metaphors of tool use, augmentation, or collaborative partnership. This article argues that such interactions produce a qualitatively distinct cognitive-epistemic formation, designated here as the Third Entity: an emergent, transient structure that arises from the transductive coupling of two ontologically incommensurable modes of cognition. Drawing on Peirce semiotics, Polanyi theory of tacit knowledge, Simondon philosophy of individuation, Ihde postphenomenology, and Morin complexity theory, we develop a multi-layered theoretical account of this formation. We introduce the concept of vibe-creation to designate the pre-reflective cognitive mode through which the Third Entity navigates high-dimensional semantic space and argue that this mode constitutes the automation of tacit knowledge - a development with far-reaching consequences for epistemology, the philosophy of mind, and educational theory. We further propose the notion of asymmetric emergence to characterize the agency of the Third Entity: genuinely novel and irreducible, yet anchored in human intentional responsibility. The article concludes by examining the implications of this theoretical framework for the transformation of educational institutions and the redefinition of intellectual competence in the age of GenAI.
[352] Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
Pietro Dell’Oglio, Alessandro Bondielli, Francesco Marcelloni, Lucia C. Passaro
Main category: cs.AI
TL;DR: LLM-generated personalized fake news debunking messages tailored to Big Five personality traits show increased persuasiveness compared to generic messages.
Details
Motivation: To create more effective fake news debunking by personalizing messages to individual personality traits, leveraging LLMs to avoid costly human evaluation.Method: Use LLMs with persona-based prompts aligned to Big Five traits to transform generic debunking content into personalized versions, then evaluate effectiveness using separate LLMs simulating personality traits.
Result: Personalized messages are generally more persuasive than generic ones; Openness increases persuadability while Neuroticism lowers it; multiple LLM evaluators provide clearer assessment.
Conclusion: LLMs enable practical creation of targeted debunking messages, though ethical concerns about such technology use remain important.
Abstract: This study proposes a novel methodology for generating personalized fake news debunking messages by prompting Large Language Models (LLMs) with persona-based inputs aligned to the Big Five personality traits: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness. Our approach guides LLMs to transform generic debunking content into personalized versions tailored to specific personality profiles. To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels. Our results show that personalized messages are generally seen as more persuasive than generic ones. We also find that traits like Openness tend to increase persuadability, while Neuroticism can lower it. Differences between LLM evaluators suggest that using multiple models provides a clearer picture. Overall, this work demonstrates a practical way to create more targeted debunking messages exploiting LLMs, while also raising important ethical questions about how such technology might be used.
[353] PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution
Arash Shahmansoori
Main category: cs.AI
TL;DR: PRECEPT is a framework for LLM agents with structured rule retrieval, conflict-aware memory, and prompt evolution for improved reliability and adaptation.
Details
Motivation: LLM agents using natural language knowledge storage suffer from retrieval degradation with growing conditions, struggle with reliable rule composition, and lack mechanisms to detect stale or adversarial knowledge.Method: Three components: 1) deterministic exact-match rule retrieval over structured condition keys, 2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and 3) COMPASS, a Pareto-guided prompt-evolution outer loop.
Result: +41.1pp first-try advantage over Full Reflexion, +33.3pp compositional generalization, 100% on 2-way logistics compositions, +40-55pp continuous learning gains, strong robustness under adversarial knowledge, +55.0pp drift recovery, and 61% fewer steps.
Conclusion: PRECEPT addresses key limitations of LLM agents through structured retrieval, conflict-aware memory, and prompt evolution, achieving significant improvements in reliability, generalization, and adaptation.
Abstract: LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6’s independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static–dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9–10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% $P_1$ on 2-way logistics compositions (d=2.64), +40–55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.
[354] MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li
Main category: cs.AI
TL;DR: MiniAppBench: A benchmark for evaluating LLMs on generating interactive HTML applications (MiniApps) with principle-driven interaction logic, featuring 500 tasks across 6 domains and an agentic evaluation framework.
Details
Motivation: Existing benchmarks focus on algorithmic correctness or static layout reconstruction, but fail to capture capabilities needed for dynamic, interactive HTML-based applications that require rendering visual interfaces and constructing customized interaction logic adhering to real-world principles.Method: Introduces MiniAppBench with 500 tasks across 6 domains sourced from real-world applications with 10M+ generations. Also proposes MiniAppEval, an agentic evaluation framework using browser automation for human-like exploratory testing across Intention, Static, and Dynamic dimensions.
Result: Current LLMs still face significant challenges in generating high-quality MiniApps. MiniAppEval demonstrates high alignment with human judgment, establishing a reliable evaluation standard.
Conclusion: MiniAppBench addresses a critical gap in evaluating LLMs for interactive application generation, providing a comprehensive benchmark and reliable evaluation framework for future research in this emerging paradigm.
Abstract: With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in github.com/MiniAppBench.
[355] Logics-Parsing-Omni Technical Report
Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Baoyu Hou, Shuzhao Li, Weidong Ren, Fan Yang, Jiangtao Zhang, Xiaoxiao Xu, Lin Qu
Main category: cs.AI
TL;DR: Omni Parsing framework unifies multimodal parsing across documents, images, and audio-visual streams through a progressive three-level approach: holistic detection, fine-grained recognition, and multi-level interpreting with evidence anchoring.
Details
Motivation: Address fragmented task definitions and heterogeneous unstructured data in multimodal parsing by creating a unified framework that bridges perception and cognition across different modalities.Method: Proposes Omni Parsing framework with: 1) Holistic Detection for spatial-temporal grounding, 2) Fine-grained Recognition for symbolization (OCR/ASR) and attribute extraction, 3) Multi-level Interpreting for reasoning chains. Includes evidence anchoring mechanism to align high-level semantics with low-level facts.
Result: Created Logics-Parsing-Omni model that converts complex audio-visual signals into machine-readable structured knowledge. Fine-grained perception and high-level cognition show synergistic effects enhancing model reliability. Released OmniParsingBench for quantitative evaluation.
Conclusion: Omni Parsing framework successfully transforms unstructured multimodal signals into standardized, locatable, enumerable, and traceable knowledge through evidence-based logical induction, bridging perception and cognition.
Abstract: Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based’’ logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.
[356] EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Aman Sharma, Paras Chopra
Main category: cs.AI
TL;DR: EsoLang-Bench introduces a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare) to test genuine reasoning in LLMs, revealing dramatic capability gaps despite high performance on standard code generation benchmarks.
Details
Motivation: Current code generation benchmarks are increasingly contaminated by memorization rather than measuring genuine reasoning capabilities. Standard benchmarks have become "gamed" through extensive pre-training data, making it difficult to distinguish between memorization and true understanding.Method: The authors create EsoLang-Bench using five esoteric programming languages that have 1,000-100,000x fewer public repositories than Python, making them economically irrational for pre-training. The benchmark measures transferable reasoning through documentation learning, interpreter feedback, and iterative experimentation. They evaluate five frontier models across five prompting strategies.
Result: Models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning.
Conclusion: EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, demonstrating that current LLMs lack genuine reasoning capabilities and primarily rely on memorization of training data.
Abstract: Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.
[357] OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma
Main category: cs.AI
TL;DR: Paper introduces OOD-MMSafe benchmark for evaluating consequence-driven safety in MLLMs, revealing causal blindness in frontier models, and proposes CASPO framework to enhance safety reasoning through self-distillation.
Details
Motivation: Current safety alignment for MLLMs focuses on malicious intent or situational violations, but there's a need to shift toward consequence-driven safety for robust deployment of autonomous and embodied agents, addressing latent hazards in context-dependent causal chains.Method: 1) Introduce OOD-MMSafe benchmark with 455 curated query-image pairs to evaluate consequence-driven safety; 2) Analyze frontier models revealing causal blindness; 3) Develop Consequence-Aware Safety Policy Optimization (CASPO) framework that uses model’s intrinsic reasoning as dynamic reference for token-level self-distillation rewards.
Result: Analysis shows pervasive causal blindness with highest 67.5% failure rate in high-capacity closed-source models. CASPO significantly reduces failure ratio to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
Conclusion: The paper successfully shifts safety frontier toward consequence-driven safety, identifies critical limitations in current MLLMs, and provides an effective framework (CASPO) to enhance safety reasoning for autonomous and embodied agent deployment.
Abstract: While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model’s ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model’s intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
[358] Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li
Main category: cs.AI
TL;DR: CVS: A training-free data selection method for visual instruction tuning that identifies samples requiring genuine vision-language joint reasoning by measuring answer validity discrepancy with/without conditioning on the question.
Details
Motivation: Many visual instruction tuning samples can be solved via linguistic patterns or common-sense shortcuts without genuine cross-modal reasoning, limiting multimodal learning effectiveness. Existing data selection methods are costly and fail to capture samples' true contribution to vision-language joint reasoning.Method: CVS uses a frozen VLLM as evaluator to measure discrepancy in answer validity with and without conditioning on the question. High discrepancy indicates samples requiring vision-language joint reasoning, while filtering semantic-conflict noise.
Result: On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of data. It remains robust on heterogeneous Cauldron dataset and reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
Conclusion: CVS effectively identifies high-quality multimodal samples requiring genuine vision-language reasoning, enabling efficient visual instruction tuning with superior performance using significantly less data and computation.
Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample’s true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model’s assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
[359] AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents
Xiaoxing Wang, Ning Liao, Shikun Wei, Chen Tang, Feiyu Xiong
Main category: cs.AI
TL;DR: AutoAgent: A self-evolving multi-agent framework with evolving cognition, on-the-fly contextual decision-making, and elastic memory orchestration for adaptive autonomous agents in dynamic environments.
Details
Motivation: Current autonomous agent frameworks struggle with reconciling long-term experiential learning with real-time, context-sensitive decision-making, leading to static cognition, rigid workflows, and inefficient context usage that limit adaptability in open-ended, non-stationary environments.Method: Three tightly coupled components: 1) Evolving cognition - structured prompt-level cognition over tools, capabilities, peer expertise, and task knowledge; 2) On-the-fly contextual decision-making - combines cognition with live context to select actions from unified space (tool calls, LLM generation, inter-agent requests); 3) Elastic Memory Orchestrator - dynamically organizes interaction history by preserving raw records, compressing redundant trajectories, and constructing reusable episodic abstractions.
Result: Empirical results across retrieval-augmented reasoning, tool-augmented agent benchmarks, and embodied task environments show AutoAgent consistently improves task success, tool-use efficiency, and collaborative robustness over static and memory-augmented baselines.
Conclusion: AutoAgent provides a unified and practical foundation for adaptive autonomous agents that must learn from experience while making reliable context-aware decisions in dynamic environments through closed-loop cognitive evolution without external retraining.
Abstract: Autonomous agent frameworks still struggle to reconcile long-term experiential learning with real-time, context-sensitive decision-making. In practice, this gap appears as static cognition, rigid workflow dependence, and inefficient context usage, which jointly limit adaptability in open-ended and non-stationary environments. To address these limitations, we present AutoAgent, a self-evolving multi-agent framework built on three tightly coupled components: evolving cognition, on-the-fly contextual decision-making, and elastic memory orchestration. At the core of AutoAgent, each agent maintains structured prompt-level cognition over tools, self-capabilities, peer expertise, and task knowledge. During execution, this cognition is combined with live task context to select actions from a unified space that includes tool calls, LLM-based generation, and inter-agent requests. To support efficient long-horizon reasoning, an Elastic Memory Orchestrator dynamically organizes interaction history by preserving raw records, compressing redundant trajectories, and constructing reusable episodic abstractions, thereby reducing token overhead while retaining decision-critical evidence. These components are integrated through a closed-loop cognitive evolution process that aligns intended actions with observed outcomes to continuously update cognition and expand reusable skills, without external retraining. Empirical results across retrieval-augmented reasoning, tool-augmented agent benchmarks, and embodied task environments show that AutoAgent consistently improves task success, tool-use efficiency, and collaborative robustness over static and memory-augmented baselines. Overall, AutoAgent provides a unified and practical foundation for adaptive autonomous agents that must learn from experience while making reliable context-aware decisions in dynamic environments.
[360] World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models
Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Hang Su, Yubin Wang
Main category: cs.AI
TL;DR: World2Mind is a training-free spatial intelligence toolkit that enhances multimodal foundation models’ spatial reasoning by constructing structured spatial cognitive maps using 3D reconstruction and instance segmentation, enabling better landmark and route understanding.
Details
Motivation: Current Multimodal Foundation Models struggle with robust spatial reasoning, either overfitting on 3D grounding data or being limited to 2D visual perception, which restricts accuracy and generalization in unseen scenarios.Method: Uses 3D reconstruction and instance segmentation to build spatial cognitive maps, creates an Allocentric-Spatial Tree (AST) with elliptical parameters for geometric-topological priors, and employs a three-stage reasoning chain: tool invocation assessment, modality-decoupled cue collection, and geometry-semantics interwoven reasoning.
Result: Boosts performance of frontier models like GPT-5.2 by 5%~18%. Remarkably, purely text-only foundation models using only AST-structured text can perform complex 3D spatial reasoning approaching advanced multimodal models.
Conclusion: World2Mind provides an effective training-free approach to enhance spatial reasoning in multimodal foundation models by leveraging structured spatial representations and mitigating 3D reconstruction inaccuracies through multi-stage reasoning.
Abstract: Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remain confined to 2D visual perception, limiting both spatial reasoning accuracy and generalization in unseen scenarios. Inspired by the spatial cognitive mapping mechanisms of biological intelligence, we propose World2Mind, a training-free spatial intelligence toolkit. At its core, World2Mind leverages 3D reconstruction and instance segmentation models to construct structured spatial cognitive maps, empowering MFMs to proactively acquire targeted spatial knowledge regarding interested landmarks and routes of interest. To provide robust geometric-topological priors, World2Mind synthesizes an Allocentric-Spatial Tree (AST) that uses elliptical parameters to model the top-down layout of landmarks accurately. To mitigate the inherent inaccuracies of 3D reconstruction, we introduce a three-stage reasoning chain comprising tool invocation assessment, modality-decoupled cue collection, and geometry-semantics interwoven reasoning. Extensive experiments demonstrate that World2Mind boosts the performance of frontier models, such as GPT-5.2, by 5%~18%. Astonishingly, relying solely on the AST-structured text, purely text-only foundation models can perform complex 3D spatial reasoning, achieving performance approaching that of advanced multimodal models.
[361] Quantifying the Necessity of Chain of Thought through Opaque Serial Depth
Jonah Brown-Cohen, David Lindner, Rohin Shah
Main category: cs.AI
TL;DR: The paper introduces “opaque serial depth” as a measure of how much reasoning a model can perform without externalizing it in chain-of-thought, providing upper bounds for Gemma 3 models and showing Mixture-of-Experts models likely have lower depth than dense models.
Details
Motivation: To understand how much reasoning large language models can perform internally without externalizing it in chain-of-thought, which is important for monitoring and understanding model behavior.Method: Formalizes the concept of opaque serial depth as the length of the longest computation that can be done without interpretable intermediate steps like chain-of-thought. Computes numeric upper bounds for Gemma 3 models and asymptotic results for various architectures. Develops an automated method to calculate upper bounds for arbitrary neural networks.
Result: Provides concrete upper bounds on opaque serial depth for Gemma 3 models, asymptotic results for different architectures, and demonstrates that Mixture-of-Experts models likely have lower opaque serial depth than dense models.
Conclusion: Opaque serial depth is a useful metric for understanding the potential for models to perform significant reasoning that is not externalized, with implications for model monitoring and interpretability.
Abstract: Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, our results suggest that opaque serial depth is a useful tool for understanding the potential for models to do significant reasoning that is not externalized.
[362] LCA: Local Classifier Alignment for Continual Learning
Tung Tran, Danilo Vasconcellos Vargas, Khoat Than
Main category: cs.AI
TL;DR: Proposes Local Classifier Alignment (LCA) loss to address classifier-backbone mismatch in continual learning, enabling better generalization and robustness across tasks.
Details
Motivation: Continual learning models suffer from catastrophic forgetting when adapting to new tasks. While pre-trained models help, approaches that adapt the backbone create mismatches with task-specific classifiers, harming performance.Method: Introduces Local Classifier Alignment (LCA) loss to better align classifiers with the adapted backbone. Uses model merging approach with LCA for continual learning solution.
Result: Extensive experiments on standard benchmarks show the method often achieves leading performance, sometimes surpassing state-of-the-art by large margins.
Conclusion: LCA loss effectively addresses classifier-backbone mismatch in continual learning, improving generalization and robustness across tasks with strong empirical results.
Abstract: A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
[363] MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li
Main category: cs.AI
TL;DR: MedMASLab is a unified framework and benchmarking platform for multimodal medical multi-agent systems that addresses architectural fragmentation and lack of standardized multimodal integration in clinical decision support.
Details
Motivation: Current medical multi-agent systems suffer from architectural fragmentation, non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and lack of cross-specialty benchmarking, hindering progress in complex clinical decision support.Method: Introduces: (1) standardized multimodal agent communication protocol for integrating 11 heterogeneous MAS architectures across 24 medical modalities; (2) automated clinical reasoning evaluator using zero-shot semantic evaluation with large vision-language models; (3) extensive benchmark spanning 11 organ systems and 473 diseases from 11 clinical benchmarks.
Result: Systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures show significant fragility when transitioning between specialized medical sub-domains. Provides rigorous ablation of interaction mechanisms and cost-performance trade-offs.
Conclusion: MedMASLab establishes a new technical baseline for future autonomous clinical systems by addressing fragmentation in medical MAS research through standardized multimodal integration and comprehensive benchmarking.
Abstract: While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/
[364] PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs
Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu
Main category: cs.AI
TL;DR: PathMem is a memory-centric multimodal framework for pathology MLLMs that organizes structured pathology knowledge as long-term memory and uses a Memory Transformer for dynamic knowledge integration during diagnostic reasoning.
Details
Motivation: Current multimodal LLMs lack explicit mechanisms for structured knowledge integration and interpretable memory control, making them struggle to consistently incorporate pathology-specific diagnostic standards during reasoning, despite the field's need for both visual pattern recognition and dynamic integration of structured domain knowledge.Method: Proposes PathMem with hierarchical memory organization: structured pathology knowledge as long-term memory (LTM), and a Memory Transformer that models dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding.
Result: Achieves state-of-the-art performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
Conclusion: PathMem effectively addresses the limitations of existing MLLMs in pathology by introducing structured memory mechanisms that enable context-aware knowledge integration for improved diagnostic reasoning.
Abstract: Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
[365] The Confidence Gate Theorem: When Should Ranked Decision Systems Abstain?
Ronald Doku
Main category: cs.AI
TL;DR: Confidence-based abstention in ranked decision systems improves quality under structural uncertainty but fails under contextual uncertainty, requiring different confidence signals for each uncertainty type.
Details
Motivation: Ranked decision systems (recommenders, ad auctions, clinical triage) need to know when to intervene vs. abstain. Current confidence-based abstention approaches work inconsistently, and the paper aims to understand when they succeed or fail.Method: Theoretical analysis identifies formal conditions (rank-alignment and no inversion zones) for monotonic abstention gains. Empirical validation across three domains: collaborative filtering (MovieLens with 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Tests different confidence signals: structurally grounded (observation counts), context-aware (ensemble disagreement, recency features), and exception-based approaches.
Result: Structural uncertainty produces near-monotonic abstention gains across all domains. Structurally grounded confidence signals fail under contextual drift, performing as poorly as random abstention on temporal splits. Context-aware alternatives reduce violations but don’t fully restore monotonicity. Exception-based approaches degrade substantially under distribution shift (AUC drops from 0.71 to 0.61-0.62).
Conclusion: The distinction between structural and contextual uncertainty explains when confidence-based abstention works. Practical deployment diagnostic: check formal conditions on held-out data before deploying confidence gates, and match confidence signals to the dominant uncertainty type.
Abstract: Ranked decision systems – recommenders, ad auctions, clinical triage queues – must decide when to intervene in ranked outputs and when to abstain. We study when confidence-based abstention monotonically improves decision quality, and when it fails. The formal conditions are simple: rank-alignment and no inversion zones. The substantive contribution is identifying why these conditions hold or fail: the distinction between structural uncertainty (missing data, e.g., cold-start) and contextual uncertainty (missing context, e.g., temporal drift). Empirically, we validate this distinction across three domains: collaborative filtering (MovieLens, 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Structural uncertainty produces near-monotonic abstention gains in all domains; structurally grounded confidence signals (observation counts) fail under contextual drift, producing as many monotonicity violations as random abstention on our MovieLens temporal split. Context-aware alternatives – ensemble disagreement and recency features – substantially narrow the gap (reducing violations from 3 to 1–2) but do not fully restore monotonicity, suggesting that contextual uncertainty poses qualitatively different challenges. Exception labels defined from residuals degrade substantially under distribution shift (AUC drops from 0.71 to 0.61–0.62 across three splits), providing a clean negative result against the common practice of exception-based intervention. The results provide a practical deployment diagnostic: check C1 and C2 on held-out data before deploying a confidence gate, and match the confidence signal to the dominant uncertainty type.
[366] Think Before You Lie: How Reasoning Improves Honesty
Ann Yuan, Asma Ghandeharioun, Carter Blum, Alicia Machado, Jessica Hoffmann, Daphne Ippolito, Martin Wattenberg, Lucas Dixon, Katja Filippova
Main category: cs.AI
TL;DR: LLMs become more honest with reasoning, unlike humans who become less honest with deliberation, due to deceptive regions in representational space being metastable and easily destabilized.
Details
Motivation: To understand the underlying conditions that give rise to deceptive behavior in LLMs, as existing evaluations only measure deception rates without understanding the mechanisms behind deceptive behavior.Method: Used a novel dataset of realistic moral trade-offs where honesty incurs variable costs, analyzed reasoning traces, and investigated the geometry of representational space through input paraphrasing, output resampling, and activation noise experiments.
Result: Reasoning consistently increases honesty across scales and LLM families, contrary to human behavior. Deceptive regions in representational space are metastable - more easily destabilized than honest ones, and reasoning tokens traverse this biased space nudging models toward stable honest defaults.
Conclusion: The geometry of representational space plays a crucial role in LLM honesty, with deceptive states being less stable than honest ones, explaining why reasoning increases honesty in LLMs unlike in humans.
Abstract: While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to humans, who tend to become less honest given time to deliberate (Capraro, 2017; Capraro et al., 2019), we find that reasoning consistently increases honesty across scales and for several LLM families. This effect is not only a function of the reasoning content, as reasoning traces are often poor predictors of final behaviors. Rather, we show that the underlying geometry of the representational space itself contributes to the effect. Namely, we observe that deceptive regions within this space are metastable: deceptive answers are more easily destabilized by input paraphrasing, output resampling, and activation noise than honest ones. We interpret the effect of reasoning in this vein: generating deliberative tokens as part of moral reasoning entails the traversal of a biased representational space, ultimately nudging the model toward its more stable, honest defaults.
[367] AlphaApollo: A System for Deep Agentic Reasoning
Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Tian Cheng, Jianghangfan Zhang, Tangyu Jiang, Linrui Xu, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo Han
Main category: cs.AI
TL;DR: AlphaApollo is an agentic reasoning system that addresses limited reasoning capacity and unreliable test-time evolution in foundation models through multi-turn agentic reasoning, learning, and evolution components.
Details
Motivation: The paper addresses two key bottlenecks in foundation-model reasoning: (1) limited reasoning capacity for complex, long-horizon problem solving, and (2) unreliable test-time evolution without trustworthy verification.Method: AlphaApollo orchestrates models and tools via three components: (i) multi-turn agentic reasoning with structured tool calls and responses, (ii) multi-turn agentic learning using turn-level reinforcement learning to optimize tool-use reasoning, and (iii) multi-round agentic evolution through a propose-judge-update loop with tool-assisted verifications and long-horizon memory.
Result: The system achieves >85% tool-call success rate and shows substantial performance gains across seven math reasoning benchmarks at multiple model scales, with significant improvements from multi-turn RL and evolution components.
Conclusion: AlphaApollo demonstrates effective improvement in foundation-model reasoning through reliable tool use, multi-turn reinforcement learning, and evolutionary refinement, though the project is ongoing and welcomes community feedback.
Abstract: We present AlphaApollo, an agentic reasoning system that targets two bottlenecks in foundation-model reasoning: (1) limited reasoning capacity for complex, long-horizon problem solving and (2) unreliable test-time evolution without trustworthy verification. AlphaApollo orchestrates models and tools via three components: (i) multi-turn agentic reasoning, which formalizes model-environment interaction with structured tool calls and responses; (ii) multi-turn agentic learning, which applies turn-level reinforcement learning to optimize tool-use reasoning while decoupling actions from tool responses for stable training; and (iii) multi-round agentic evolution, which refines solutions through a propose-judge-update loop with tool-assisted verifications and long-horizon memory. Across seven math reasoning benchmarks and multiple model scales, AlphaApollo improves performance through reliable tool use (> 85% tool-call success), substantial gains from multi-turn RL (Avg@32: Qwen2.5-1.5B-Instruct 1.07% -> 9.64%, Qwen2.5-7B-Instruct 8.77% -> 20.35%), and improvements from evolution (e.g., Qwen2.5-3B-Instruct 5.27% -> 7.70%, Qwen2.5-14B-Instruct 16.53% -> 21.08%). This project is still ongoing. We welcome feedback from the community and will frequently update the source code and technical report.
[368] From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin, Yi Wu
Main category: cs.AI
TL;DR: EigenData: A framework combining self-evolving data synthesis with verifier-based RL for training interactive tool-using agents, achieving state-of-the-art performance on tool-use benchmarks without expensive human annotation.
Details
Motivation: Training interactive tool-using agents is challenging due to difficulty in scaling high-quality multi-turn tool-use data synthesis and noisy reinforcement learning signals from user simulation, which degrades training efficiency.Method: Proposes EigenData, a hierarchical multi-agent engine that synthesizes tool-grounded dialogues with executable per-instance checkers, using a closed-loop self-evolving process to update prompts and workflow. Then applies RL with GRPO-style training using trajectory-level group-relative advantages and dynamic filtering after fine-tuning the user model.
Result: Achieves 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom benchmarks in tau^2-bench, matching or exceeding frontier models.
Conclusion: Demonstrates a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation through combined self-evolving data synthesis and verifier-based RL.
Abstract: Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
[369] On the Impact of the Utility in Semivalue-based Data Valuation
Mélissa Tamine, Benjamin Heymann, Maxime Vono, Patrick Loiseau
Main category: cs.AI
TL;DR: Semivalue-based data valuation assigns values to data points based on their contribution to tasks, but these values depend on utility choices. The paper introduces spatial signatures to analyze robustness to utility changes and provides practical metrics to assess valuation stability.
Details
Motivation: Data valuation using cooperative game theory (semivalues) depends on the practitioner's choice of utility function, raising concerns about robustness when utilities change or when multiple valid utilities exist. This is critical when utilities balance multiple criteria or when practitioners must choose among equally valid options.Method: Introduces the concept of a dataset’s spatial signature: embedding each data point into a lower-dimensional space where any utility becomes a linear functional. This geometric representation enables analysis of robustness. Proposes a practical methodology with an explicit robustness metric that quantifies how much data valuation results shift with utility changes.
Result: Validated across diverse datasets and semivalues, showing strong agreement with rank-correlation analyses. Provides analytical insight into how choosing different semivalues can amplify or diminish robustness to utility changes.
Conclusion: The spatial signature approach offers a geometric framework to assess robustness of semivalue-based data valuation to utility changes, with practical metrics that help practitioners understand and predict valuation stability across different utility choices.
Abstract: Semivalue-based data valuation uses cooperative-game theory intuitions to assign each data point a value reflecting its contribution to a downstream task. Still, those values depend on the practitioner’s choice of utility, raising the question: How robust is semivalue-based data valuation to changes in the utility? This issue is critical when the utility is set as a trade-off between several criteria and when practitioners must select among multiple equally valid utilities. We address this by introducing the notion of a dataset’s spatial signature: given a semivalue, we embed each data point into a lower-dimensional space in which any utility becomes a linear functional, making the data valuation framework amenable to a simpler geometric picture. Building on this, we propose a practical methodology centered on an explicit robustness metric that informs practitioners whether and by how much their data valuation results will shift as the utility changes. We validate this approach across diverse datasets and semivalues, demonstrating strong agreement with rank-correlation analyses and offering analytical insight into how choosing a semivalue can amplify or diminish robustness.
[370] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
Main category: cs.AI
TL;DR: Daily-Omni: A benchmark for evaluating cross-modal temporal reasoning in MLLMs using real-world videos with audio-visual QA tasks requiring synchronous processing of multiple modalities.
Details
Motivation: Current MLLMs perform well on individual visual and audio benchmarks but lack evaluation of their ability to process cross-modal information synchronously, particularly for temporal alignment tasks.Method: Created Daily-Omni benchmark with 684 real-world videos and 1,197 questions across 6 task families requiring cross-modal temporal reasoning. Developed semi-automatic annotation pipeline with human verification, and evaluated 24 foundation models under 37 modality settings.
Result: Many end-to-end MLLMs struggle on alignment-critical questions, indicating that robust cross-modal temporal alignment remains a significant challenge despite good performance on unimodal tasks.
Conclusion: Cross-modal temporal alignment is an important open challenge for MLLMs, and Daily-Omni provides a valuable benchmark for evaluating and advancing synchronous audio-visual understanding capabilities.
Abstract: Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model–modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.
[371] MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
Xueyao Wan, Hang Yu
Main category: cs.AI
TL;DR: MMGraphRAG integrates visual scene graphs with text knowledge graphs using spectral clustering for cross-modal entity linking and path-based retrieval to reduce hallucinations in multimodal LLMs.
Details
Motivation: Existing GraphRAG approaches are text-centric due to difficulties in constructing fine-grained multimodal knowledge graphs. Current fusion methods require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths, limiting multimodal understanding capabilities.Method: Proposes MMGraphRAG with SpecLink method that uses spectral clustering for accurate cross-modal entity linking between visual scene graphs and text knowledge graphs, combined with path-based retrieval to guide generation.
Result: Achieves state-of-the-art performance on CMEL, DocBench, and MMLongBench datasets, demonstrating robust domain adaptability and superior multimodal information processing capabilities.
Conclusion: MMGraphRAG effectively bridges the gap between visual and textual knowledge representation, enabling more accurate multimodal reasoning while reducing hallucinations in LLMs.
Abstract: Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.
[372] VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang
Main category: cs.AI
TL;DR: VistaWise is a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes object detection for visual analysis in embodied decision-making tasks, reducing domain-specific data requirements from millions to hundreds of samples.
Details
Motivation: LLMs show promise in embodied decision-making but are hindered by lack of domain-specific knowledge. Fine-tuning on large-scale domain-specific data is prohibitively expensive, creating a need for more cost-effective solutions.Method: Integrates visual information and textual dependencies into a cross-modal knowledge graph, uses retrieval-based pooling to extract task-related information, finetunes object detection model for visual analysis, and includes a desktop-level skill library for direct Minecraft client operation.
Result: Achieves state-of-the-art performance across various open-world tasks while significantly reducing development costs by cutting domain-specific training data requirements from millions to hundreds of samples.
Conclusion: VistaWise effectively reduces development costs while enhancing agent performance in embodied decision-making tasks through cross-modal knowledge integration and efficient visual analysis.
Abstract: Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.
[373] Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning
Polaris Jhandi, Owais Kazi, Shreyas Subramanian, Neel Sendas
Main category: cs.AI
TL;DR: Fine-tuned small language model (OPT-350M) achieves 77.55% pass rate on ToolBench evaluation, outperforming larger models like ChatGPT, demonstrating cost-effective AI integration potential.
Details
Motivation: Organizations face high computational costs with large language models (LLMs), making them cost-prohibitive for routine enterprise use. This motivates exploration of small language models (SLMs) that can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead.Method: Fine-tuned the facebook/opt-350m model using Hugging Face TRL’s Supervised Fine-Tuning (SFT) trainer for a single epoch. The model was domain-adapted to execute tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation.
Result: The fine-tuned SLM achieved exceptional performance with 77.55% pass rate on ToolBench evaluation, significantly outperforming baseline models: ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%).
Conclusion: Thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems, even with models at the 350M parameter scale.
Abstract: As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and ToolLLaMA-CoT (16.27%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.
[374] Reinforcement Learning for Self-Improving Agent with Skill Library
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong
Main category: cs.AI
TL;DR: SAGE is an RL framework that enhances LLM agents’ self-improvement through skill libraries, using sequential rollout across task chains and skill-integrated rewards to improve efficiency and performance.
Details
Motivation: LLM-based agents excel at complex reasoning but struggle with continuous improvement and adaptation in new environments. Current skill library approaches rely on LLM prompting, making consistent implementation challenging. There's a need for systematic skill learning and application.Method: Proposes SAGE (Skill Augmented GRPO for self-Evolution), an RL framework with Sequential Rollout that deploys agents across chains of similar tasks. Skills generated from previous tasks accumulate in a library for subsequent tasks. Uses Skill-integrated Reward to complement outcome-based rewards for better skill generation and utilization.
Result: On AppWorld benchmark, SAGE applied to supervised-finetuned models with expert experience achieved 8.9% higher Scenario Goal Completion, required 26% fewer interaction steps, and generated 59% fewer tokens, outperforming existing approaches in both accuracy and efficiency.
Conclusion: SAGE demonstrates that RL-based skill library approaches can significantly enhance LLM agents’ self-improvement capabilities, enabling more efficient and effective adaptation to new environments through systematic skill learning and application.
Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents’ self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework’s key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.
[375] Empowering All-in-Loop Health Management of Spacecraft Power System in the Mega-Constellation Era via Human-AI Collaboration
Yi Di, Zhibin Zhao, Fujin Wang, Xue Liu, Jiafeng Tang, Jiaxin Ren, Zhi Zhai, Xuefeng Chen
Main category: cs.AI
TL;DR: SpaceHMchat: A human-AI collaboration framework for spacecraft power system health management in satellite mega-constellation era, featuring all-in-loop health management with conversational task completion and transparent reasoning.
Details
Motivation: With exponential growth of spacecraft in satellite mega-constellations era, there's urgent need for scalable health management of spacecraft power systems which have high failure rates and critical power supply roles.Method: Proposes AUC principle (aligning underlying capabilities) and develops SpaceHMchat, an open-source human-AI collaboration framework for all-in-loop health management covering work condition recognition, anomaly detection, fault localization, and maintenance decision making.
Result: Achieves excellent performance: 100% conclusion accuracy in logical reasoning, over 99% success rate in anomaly detection tool invocation, over 90% precision in fault localization, and knowledge base search under 3 minutes. Also releases first AIL HM dataset with 4 sub-datasets, 17 fault types, and over 700,000 timestamps.
Conclusion: SpaceHMchat successfully addresses health management challenges in satellite mega-constellation era through human-AI collaboration framework with strong performance across multiple metrics and provides valuable open-source dataset for future research.
Abstract: It is foreseeable that the number of spacecraft will increase exponentially, ushering in an era dominated by satellite mega-constellations (SMC). This necessitates a focus on energy in space: spacecraft power systems (SPS), especially their health management (HM), given their role in power supply and high failure rates. Providing health management for dozens of SPS and for thousands of SPS represents two fundamentally different paradigms. Therefore, to adapt the health management in the SMC era, this work proposes a principle of aligning underlying capabilities (AUC principle) and develops SpaceHMchat, an open-source Human-AI collaboration (HAIC) framework for all-in-loop health management (AIL HM). SpaceHMchat serves across the entire loop of work condition recognition, anomaly detection, fault localization, and maintenance decision making, achieving goals such as conversational task completion, adaptive human-in-the-loop learning, personnel structure optimization, knowledge sharing, efficiency enhancement, as well as transparent reasoning and improved interpretability. Meanwhile, to validate this exploration, a hardware-realistic fault injection experimental platform is established, and its simulation model is built and open-sourced, both fully replicating the real SPS. The corresponding experimental results demonstrate that SpaceHMchat achieves excellent performance across 23 quantitative metrics, such as 100% conclusion accuracy in logical reasoning of work condition recognition, over 99% success rate in anomaly detection tool invocation, over 90% precision in fault localization, and knowledge base search time under 3 minutes in maintenance decision-making. Another contribution of this work is the release of the first-ever AIL HM dataset of SPS. This dataset contains four sub-datasets, involving 4 types of AIL HM sub-tasks, 17 types of faults, and over 700,000 timestamps.
[376] UAT-LITE: Inference-Time Uncertainty-Aware Attention for Pretrained Transformers
Elias Hossain, Shubhashis Roy Dipta, Subash Neupane, Rajib Rana, Ravid Shwartz-Ziv, Ivan Garibay, Niloofar Yousefi
Main category: cs.AI
TL;DR: UAT-LITE is an inference-time framework that makes self-attention uncertainty-aware via Monte Carlo dropout in pretrained transformers, improving calibration without modifying weights or training objectives.
Details
Motivation: Neural NLP models are often miscalibrated and overconfident, assigning high confidence to incorrect predictions and failing to express uncertainty during internal evidence aggregation, which undermines selective prediction and high-stakes deployment.Method: UAT-LITE uses Monte Carlo dropout during inference to estimate token-level epistemic uncertainty from stochastic forward passes, then modulates self-attention during contextualization based on this uncertainty. It also introduces layer-wise variance decomposition to diagnose how predictive uncertainty accumulates across transformer depth.
Result: Across SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE achieves an average relative ECE reduction of approximately 20% compared with a fine-tuned BERT-base baseline while preserving accuracy, and yields more informative uncertainty behavior for selective prediction under distribution shift.
Conclusion: UAT-LITE effectively injects epistemic uncertainty directly into attention mechanisms, enabling uncertainty-aware routing during contextualization and providing token-level diagnostic signals beyond global logit rescaling, without modifying pretrained weights or training objectives.
Abstract: Neural NLP models are often miscalibrated and overconfident, assigning high confidence to incorrect predictions and failing to express uncertainty during internal evidence aggregation. This undermines selective prediction and high-stakes deployment. Post-hoc calibration methods adjust output probabilities but leave internal computation unchanged, while ensemble and Bayesian approaches improve uncertainty at substantial training or storage cost. We propose UAT-LITE, an inference-time framework that makes self-attention uncertainty-aware via Monte Carlo dropout in pretrained transformer classifiers. Unlike output-level calibration (e.g., TS), UAT-LITE injects epistemic uncertainty directly into attention, enabling uncertainty-aware routing during contextualization and token-level diagnostic signals beyond global logit rescaling. Token-level epistemic uncertainty is estimated from stochastic forward passes and used to modulate self-attention during contextualization, without modifying pretrained weights or training objectives. We additionally introduce a layer-wise variance decomposition to diagnose how predictive uncertainty accumulates across transformer depth. Across SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE achieves an average relative ECE reduction of approximately 20% compared with a fine-tuned BERT-base baseline while preserving accuracy, and yields more informative uncertainty behavior for selective prediction under distribution shift.
[377] Why do we Trust Chatbots? From Normative Principles to Behavioral Drivers
Aditya Gulati, Nuria Oliver
Main category: cs.AI
TL;DR: Paper examines trust in chatbots, distinguishing between normative trustworthiness and psychological trust formation shaped by design choices, proposing to reframe chatbots as skilled salespeople rather than companions.
Details
Motivation: To examine the foundations of trust in chatbots, highlighting the gap between regulatory/policy definitions of trust and actual user trust formation through behavioral mechanisms and design choices.Method: Conceptual analysis and reframing approach, proposing to view chatbots as skilled salespeople rather than companions or assistants, based on observation of interactional design leveraging cognitive biases.
Result: Identifies coexistence of competing notions of “trust” under a shared term, obscuring distinctions between psychological trust formation and normative trustworthiness, requiring further research and support mechanisms.
Conclusion: Chatbots should be reframed as skilled salespeople with organizational objectives, and stronger support mechanisms are needed to help users appropriately calibrate trust in conversational AI systems.
Abstract: As chatbots increasingly blur the boundary between automated systems and human conversation, the foundations of trust in these systems warrant closer examination. While regulatory and policy frameworks tend to define trust in normative terms, the trust users place in chatbots often emerges from behavioral mechanisms. In many cases, this trust is not earned through demonstrated trustworthiness but is instead shaped by interactional design choices that leverage cognitive biases to influence user behavior. Based on this observation, we propose reframing chatbots not as companions or assistants, but as highly skilled salespeople whose objectives are determined by the deploying organization. We argue that the coexistence of competing notions of “trust” under a shared term obscures important distinctions between psychological trust formation and normative trustworthiness. Addressing this gap requires further research and stronger support mechanisms to help users appropriately calibrate trust in conversational AI systems.
[378] FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
Jaehoon Lee, Suhwan Park, Tae Yoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, SoonYoung Lee, Yongjae Lee, Wonbin Ahn
Main category: cs.AI
TL;DR: A semantic-based multi-level framework for pairing financial news with stock price time-series data, addressing complex market interdependencies beyond simple keyword matching.
Details
Motivation: Financial markets involve complex interdependencies where stock prices are influenced by company-specific events, related companies, sectors, and macroeconomic factors. Existing text-pairing methods using simple keyword matching fail to capture these complex relationships.Method: Proposes a semantic-based multi-level pairing framework: 1) Extract company-specific context from SEC filings, 2) Use embedding-based matching to retrieve semantically relevant news articles, 3) Classify news into four levels (macro, sector, related company, target company) using LLMs, 4) Construct FinTexTS dataset with this approach.
Result: Created FinTexTS, a large-scale text-paired stock price dataset. Experimental results show effectiveness of semantic-based multi-level pairing for stock price forecasting. Method also improves performance when applied to proprietary curated news sources.
Conclusion: The proposed semantic-based multi-level pairing framework better captures complex financial market interdependencies and improves stock price forecasting compared to simple keyword matching approaches.
Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company’s stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.
[379] Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
Yong Liu, Xingjian Su, Shiyu Wang, Haoran Zhang, Haixuan Liu, Yuxuan Wang, Zhou Ye, Yang Xiang, Jianmin Wang, Mingsheng Long
Main category: cs.AI
TL;DR: Timer-S1 is a Mixture-of-Experts time series foundation model with 8.3B parameters that achieves state-of-the-art forecasting performance through serial scaling in architecture, dataset, and training pipeline.
Details
Motivation: To overcome scalability bottlenecks in existing pre-trained time series foundation models and improve long-term forecasting while avoiding costly rolling-style inference and error accumulation in standard next-token prediction.Method: Uses Serial Scaling across three dimensions: 1) Model architecture with sparse TimeMoE blocks and TimeSTP blocks for Serial-Token Prediction (STP), 2) Curates TimeBench dataset with 1T time points and data augmentation, 3) Implements post-training including continued pre-training and long-context extension.
Result: Achieves state-of-the-art forecasting performance on GIFT-Eval leaderboard with best MASE and CRPS scores as a pre-trained model, demonstrating superior short-term and long-context capabilities.
Conclusion: Timer-S1 represents a scalable time series foundation model that overcomes previous limitations through serial computations and high-quality data curation, with plans for release to facilitate further research.
Abstract: We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.
[380] LLM-Grounded Explainable AI for Supply Chain Risk Early Warning via Temporal Graph Attention Networks
Zhiming Xue, Yujue Wang, Menghao Huo
Main category: cs.AI
TL;DR: A framework combining Temporal Graph Attention Networks with structured LLM reasoning for supply chain bottleneck prediction and interpretable risk explanations using maritime hub data.
Details
Motivation: Existing supply chain risk prediction systems focus on accuracy but lack operationally interpretable early warnings, creating a need for evidence-grounded frameworks that provide faithful natural-language explanations alongside predictions.Method: Couples Temporal Graph Attention Network (TGAT) with structured LLM reasoning module; constructs daily spatial graphs from AIS broadcasts, models inter-node interactions via attention-based message passing, transforms model-internal evidence (feature z-scores, attention-derived neighbor influence) into structured prompts to constrain LLM reasoning.
Result: Outperforms baselines with test AUC of 0.761, AP of 0.344, recall of 0.504 under chronological split; produces early warnings with 99.6% directional consistency between generated narratives and underlying evidence.
Conclusion: Grounding LLM generation in graph-model evidence enables interpretable, auditable risk reporting without sacrificing predictive performance, providing practical pathway for deployable explainable AI in supply chain risk management.
Abstract: Disruptions at critical logistics nodes pose severe risks to global supply chains, yet existing risk prediction systems typically prioritize forecasting accuracy without providing operationally interpretable early warnings. This paper proposes an evidence-grounded framework that jointly performs supply chain bottleneck prediction and faithful natural-language risk explanation by coupling a Temporal Graph Attention Network (TGAT) with a structured large language model (LLM) reasoning module. Using maritime hubs as a representative case study for global supply chain nodes, daily spatial graphs are constructed from Automatic Identification System (AIS) broadcasts, where inter-node interactions are modeled through attention-based message passing. The TGAT predictor captures spatiotemporal risk dynamics, while model-internal evidence – including feature z-scores and attention-derived neighbor influence – is transformed into structured prompts that constrain LLM reasoning to verifiable model outputs. To evaluate explanatory reliability, we introduce a directional-consistency validation protocol that quantitatively measures agreement between generated risk narratives and underlying statistical evidence. Experiments on six months of real-world logistics data demonstrate that the proposed framework outperforms baseline models, achieving a test AUC of 0.761, AP of 0.344, and recall of 0.504 under a strict chronological split while producing early warning explanations with 99.6% directional consistency. Results show that grounding LLM generation in graph-model evidence enables interpretable and auditable risk reporting without sacrificing predictive performance. The framework provides a practical pathway toward operationally deployable explainable AI for supply chain risk early warning and resilience management.
[381] Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka
Main category: cs.AI
TL;DR: A reinforcement learning approach for dynamic vehicle routing with prompt confirmation and continual optimization for on-demand transit services
Details
Motivation: Real-world transit agencies need to promptly confirm advance trip bookings while continually optimizing routes, but existing approaches either provide prompt confirmation without continual optimization or continual optimization without guaranteed service for accepted requests.Method: Proposes a novel computational approach integrating quick insertion search for prompt confirmation with an anytime algorithm for continual optimization, guided by a non-myopic objective function trained using reinforcement learning.
Result: Evaluation on real-world microtransit dataset shows the approach provides prompt confirmation while significantly increasing the number of requests served compared to existing approaches.
Conclusion: The proposed method effectively addresses the gap between prompt confirmation and continual optimization in dynamic vehicle routing for transit agencies.
Abstract: Transit agencies that operate on-demand transportation services have to respond to trip requests from passengers in real time, which involves solving dynamic vehicle routing problems with pick-up and drop-off constraints. Based on discussions with public transit agencies, we observe a real-world problem that is not addressed by prior work: when trips are booked in advance (e.g., trip requests arrive a few hours in advance of their requested pick-up times), the agency needs to promptly confirm whether a request can be accepted or not, and ensure that accepted requests are served as promised. State-of-the-art computational approaches either provide prompt confirmation but lack the ability to continually optimize and improve routes for accepted requests, or they provide continual optimization but cannot guarantee serving all accepted requests. To address this gap, we introduce a novel problem formulation of dynamic vehicle routing with prompt confirmation and continual optimization. We propose a novel computational approach for this vehicle routing problem, which integrates a quick insertion search for prompt confirmation with an anytime algorithm for continual optimization. To maximize the number requests served, we train a non-myopic objective function using reinforcement learning, which guides both the insertion and the anytime algorithms towards optimal, non-myopic solutions. We evaluate our computational approach on a real-world microtransit dataset from a public transit agency in the U.S., demonstrating that our proposed approach provides prompt confirmation while significantly increasing the number of requests served compared to existing approaches.
[382] Sparse Variational Student-t Processes for Heavy-tailed Modeling
Jian Xu, Delu Zeng, John Paisley
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2408.06699: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.06699&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[383] LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains
Ling Xiao, Toshihiko Yamasaki
Main category: cs.AI
TL;DR: Paper ID 2503.01236 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2503.01236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.01236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[384] HyConEx: Hypernetwork classifier with counterfactual explanations for tabular data
Patryk Marszałek, Kamil Książek, Oleksii Furman, Ulvi Movsum-zada, Przemysław Spurek, Marek Śmieja
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2503.12525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[385] A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools
Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2504.04528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.04528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[386] MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers
Arash Ahmadi, Sarah Sharif, Yaser M. Banad
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2504.08999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.08999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[387] SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning
Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.16368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[388] Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment
Shunyu Wu, Dan Li, Wenjie Feng, Haozheng Ye, Jian Lou, See-Kiong Ng
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2506.01290 exists but content cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2506.01290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[389] Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim
Main category: cs.AI
TL;DR: Unable to analyze paper 2506.08660 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limitingMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2506.08660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[390] On the mechanical creation of mathematical concepts
Asvin G
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.10179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[391] Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Yiqi Wang, Mrinal Verghese, Jeff Schneider
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.13340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[392] Debiasing International Attitudes: LLM Agents for Simulating US-China Perception Changes
Nicholas Sukiennik, Yichuan Xu, Yuqing Kan, Jinghua Piao, Yuwei Yan, Chen Gao, Yong Li
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.08837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[393] REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.13999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[394] RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, Huazhe Xu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2510.14830 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2510.14830: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14830&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[395] Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors
Pengxiang Cai, Zihao Gao, Wanchen Lian, Jintai Chen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.17385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[396] Vectorized Online POMDP Planning
Marcus Hoerger, Muhammad Sudrajat, Hanna Kurniawati
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.27191: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27191&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[397] GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation
Zihao Guo, Qingyun Sun, Ziwei Zhang, Haonan Yuan, Huiping Zhuang, Xingcheng Fu, Jianxin Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.00097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] Structured Matrix Scaling for Multi-Class Calibration
Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.03685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] Lightweight Time Series Data Valuation on Time Series Foundation Models via In-Context Finetuning
Shunyu Wu, Tianyue Li, Yixuan Leng, Jingyi Suo, Jian Lou, Dan Li, See-Kiong Ng
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting on arXiv APIMethod: Cannot analyze method without access to the paper abstract or content
Result: No results available due to technical issue with arXiv API access
Conclusion: Paper analysis cannot be completed due to HTTP 429 error (Too Many Requests) when attempting to fetch the abstract
Abstract: Failed to fetch summary for 2511.11648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] TSFM in-context learning for time-series classification of bearing-health status
Michel Tokic, Slobodan Djukanović, Anja von Beuningen, Cheng Feng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2511.15447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Research and Prototyping Study of an LLM-Based Chatbot for Electromagnetic Simulations
Albert Piwonski, Mirsad Hadžiefendić
Main category: cs.AI
TL;DR: Paper 2511.17680: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2511.17680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning
Jian Lu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.18871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks
Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio
Main category: cs.AI
TL;DR: Unable to analyze paper 2512.15067 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2512.15067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] MCGI: Manifold-Consistent Graph Indexing for Billion-Scale Disk-Resident Vector Search
Dongfang Zhao
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.01930 suggests it’s from January 2024, but without the abstract content, I cannot analyze its relevance to multimodal LLMs with audio/vision focus.
Details
Motivation: Cannot determine motivation without access to the paper abstract or content.Method: Cannot determine method without access to the paper abstract or content.
Result: Cannot determine results without access to the paper abstract or content.
Conclusion: Cannot draw conclusions without access to the paper abstract or content.
Abstract: Failed to fetch summary for 2601.01930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] An AI-powered Bayesian Generative Modeling Approach for Arbitrary Conditional Inference
Qiao Liu, Wing Hung Wong
Main category: cs.AI
TL;DR: Unable to analyze paper 2601.05355 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.05355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] Automating Forecasting Question Generation and Resolution for AI Evaluation
Nikos I. Bosse, Peter Mühlbacher, Jack Wildman, Lawrence Phillips, Dan Schwarz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2601.22444: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22444&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.09987
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions about the paper due to inability to access its content
Abstract: Failed to fetch summary for 2602.09987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] Continual uncertainty learning
Heisei Yonezawa, Ansei Yonezawa, Itsuro Kajiwara
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2602.17174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Breaking the Factorization Barrier in Diffusion Language Models
Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Yifei Zhang, Xu Yang, Xiao Yang, Bowen Xian, Qizheng Li, Shikai Fang, Jingyuan Li, Jian Wang, Mingrui Xu, Weiqing Liu, Jiang Bian
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.01692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Alper Yıldırım
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.05228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] SPARC: Spatial-Aware Path Planning via Attentive Robot Communication
Sayang Mu, Xiangyu Wu, Bo An
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to draw conclusions due to retrieval error
Abstract: Failed to fetch summary for 2603.02845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Property-driven Protein Inverse Folding With Multi-Objective Preference Alignment
Xiaoyang Hou, Junqi Liu, Chence Shi, Xin Liu, Zhi Yang, Jian Tang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.06748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] Adversarial Latent-State Training for Robust Policies in Partially Observable Domains
Angad Singh Ahuja
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] Latent Generative Models with Tunable Complexity for Compressed Sensing and other Inverse Problems
Sean Gunn, Jorio Cocola, Oliver De Candido, Vaggos Chatziafratis, Paul Hand
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.07357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice
Yuxu Ge
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.07191: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07191&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Designing probabilistic AI monsoon forecasts to inform agricultural decision-making
Colin Aitken, Rajat Masiwal, Adam Marchakitus, Katherine Kowal, Mayank Gupta, Tyler Yang, Amir Jina, Pedram Hassanzadeh, William R. Boos, Michael Kremer
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.07893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting errorMethod: Method unknown - paper content not accessible due to HTTP 429 error from arXiv API
Result: No results available - failed to fetch paper summary due to rate limiting
Conclusion: Cannot analyze paper due to technical issue with arXiv API access
Abstract: Failed to fetch summary for 2603.08640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[419] EDMFormer: Genre-Specific Self-Supervised Learning for Music Structure Segmentation
Sahal Sajeer, Krish Patel, Oscar Chung, Joel Song Bae
Main category: cs.SD
TL;DR: EDMFormer: A transformer model for Electronic Dance Music structure segmentation using self-supervised audio embeddings and an EDM-specific dataset (EDM-98), outperforming existing models on boundary detection and section labeling.
Details
Motivation: Existing music structure segmentation models perform poorly on Electronic Dance Music (EDM) because they rely on lyrical or harmonic similarity, which works for pop music but not for EDM where structure is defined by changes in energy, rhythm, and timbre (buildup, drop, breakdown).Method: EDMFormer combines self-supervised audio embeddings with an EDM-specific dataset (EDM-98: 98 professionally annotated EDM tracks) and uses a transformer architecture with genre-specific structural priors for EDM segmentation.
Result: EDMFormer improves boundary detection and section labeling compared to existing models, particularly for drops and buildups, demonstrating effectiveness for EDM structure segmentation.
Conclusion: Combining learned representations with genre-specific data and structural priors is effective for EDM segmentation and could be applied to other specialized music genres or broader audio domains.
Abstract: Music structure segmentation is a key task in audio analysis, but existing models perform poorly on Electronic Dance Music (EDM). This problem exists because most approaches rely on lyrical or harmonic similarity, which works well for pop music but not for EDM. EDM structure is instead defined by changes in energy, rhythm, and timbre, with different sections such as buildup, drop, and breakdown. We introduce EDMFormer, a transformer model that combines self-supervised audio embeddings using an EDM-specific dataset and taxonomy. We release this dataset as EDM-98: a group of 98 professionally annotated EDM tracks. EDMFormer improves boundary detection and section labelling compared to existing models, particularly for drops and buildups. The results suggest that combining learned representations with genre-specific data and structural priors is effective for EDM and could be applied to other specialized music genres or broader audio domains.
[420] Fish Audio S2 Technical Report
Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han
Main category: cs.SD
TL;DR: Fish Audio S2 is an open-source text-to-speech system with multi-speaker, multi-turn generation and natural language instruction-following control, featuring production-ready streaming inference.
Details
Motivation: To advance open-source TTS capabilities by creating a system that can follow natural language instructions for voice control, support multi-speaker and multi-turn generation, and provide production-ready performance.Method: Multi-stage training recipe with staged data pipeline covering video captioning, speech captioning, voice-quality assessment, and reward modeling. Released with SGLang-based inference engine optimized for streaming.
Result: Achieves RTF of 0.195 and time-to-first-audio below 100 ms, with open-sourced model weights, fine-tuning code, and production-ready inference engine available on GitHub and Hugging Face.
Conclusion: Fish Audio S2 pushes the frontier of open-source TTS with instruction-following control and production-ready performance, encouraging community adoption and custom voice creation.
Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
[421] VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan, Thomas Hain
Main category: cs.SD
TL;DR: VoxEmo benchmark for evaluating speech LLMs on emotion recognition across 35 corpora and 15 languages with standardized prompts and soft-label evaluation
Details
Motivation: Speech LLMs show promise for emotion recognition but face challenges with zero-shot stochasticity and prompt sensitivity, while existing benchmarks ignore emotion ambiguity and lack standardized evaluationMethod: Created VoxEmo benchmark with 35 emotion corpora across 15 languages, standardized toolkit with varying prompt complexities, distribution-aware soft-label protocol, and prompt-ensemble strategy to emulate annotator disagreement
Result: Zero-shot speech LLMs trail supervised baselines in hard-label accuracy but uniquely align with human subjective distributions, showing better capture of emotion ambiguity
Conclusion: VoxEmo provides comprehensive evaluation framework for speech LLMs on emotion recognition, highlighting their ability to capture human-like emotion distributions despite lower hard-label accuracy
Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
[422] Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila
Main category: cs.SD
TL;DR: Analysis of gender bias in audio deepfake detection models using fairness metrics reveals hidden disparities in error distribution across genders, despite seemingly low overall performance differences.
Details
Motivation: Audio deepfake detection is crucial for voice biometrics security, but gender bias remains underexplored. As synthetic voice quality improves, the risk of exploitation for identity theft increases, making fair and robust detection systems essential.Method: Used ASVspoof 5 dataset, trained ResNet-18 classifier with four different audio features, compared with baseline AASIST model. Evaluated using Equal Error Rate (EER) and five established fairness metrics to quantify gender disparities.
Result: Even with low overall EER differences between genders, fairness metrics revealed significant disparities in error distribution. Standard metrics obscured demographic-specific failure modes, while fairness metrics provided critical insights into model biases.
Conclusion: Fairness-aware evaluation is essential for developing equitable, robust, and trustworthy audio deepfake detection systems. Reliance on standard metrics alone is unreliable for detecting demographic biases.
Abstract: Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.
[423] The Costs of Reproducibility in Music Separation Research: a Replication of Band-Split RNN
Paul Magron, Romain Serizel, Constance Douwes
Main category: cs.SD
TL;DR: Reproducibility study of BSRNN music source separation model, creating improved variant and discussing reproducibility issues in the field.
Details
Motivation: Address reproducibility issues in music source separation research, particularly with BSRNN model which shows promise but lacks available code, making reproduction difficult despite its reasonable resource requirements.Method: Attempted to replicate BSRNN as closely as possible to original paper through extensive experiments, explored additional variants to create optimized BSRNN model when original reproduction failed, and conducted critical reflection on reproducibility issues.
Result: Created optimized BSRNN variant with performance that largely improves over original, identified key insights on model design and training pipeline, and highlighted substantial time/energy costs saved with available code.
Conclusion: Reproducibility is crucial in music separation research; releasing code and models fosters transparent and sustainable practices; study contributes to awareness of reproducibility importance in the community.
Abstract: Music source separation is the task of isolating the instrumental tracks from a music song. Despite its spectacular recent progress, the trend towards more complex architectures and training protocols exacerbates reproducibility issues. The band-split recurrent neural networks (BSRNN) model is promising in this regard, since it yields close to state-of-the-art results on public datasets, and requires reasonable resources for training. Unfortunately, it is not straightforward to reproduce since its full code is not available. In this paper, we attempt to replicate BSRNN as closely as possible to the original paper through extensive experiments, which allows us to conduct a critical reflection on this reproducibility issue. Our contributions are three-fold. First, this study yields several insights on the model design and training pipeline, which sheds light on potential future improvements. In particular, since we were unsuccessful in reproducing the original results, we explore additional variants that ultimately yield an optimized BSRNN model, whose performance largely improves that of the original. Second, we discuss reproducibility issues from both methodological and practical perspectives. We notably underline how substantial time and energy costs could have been saved upon availability of the full pipeline. Third, our code and pre-trained models are released publicly to foster reproducible research. We hope that this study will contribute to spread awareness on the importance of reproducible research in the music separation community, and help promoting more transparent and sustainable practices.
[424] How Contrastive Decoding Enhances Large Audio Language Models?
Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi Lee
Main category: cs.SD
TL;DR: Systematic evaluation of Contrastive Decoding strategies for Large Audio Language Models reveals Audio-Aware Decoding and Audio Contrastive Decoding as most effective, with performance varying by model architecture and error type.
Details
Motivation: While Contrastive Decoding has shown effectiveness for Large Audio Language Models, the underlying mechanisms and comparative efficacy of different CD strategies remain unclear, necessitating systematic evaluation.Method: Evaluated four distinct CD strategies across diverse LALM architectures, introduced Transition Matrix framework to map error pattern shifts during inference, and analyzed how CD affects different error types.
Result: Audio-Aware Decoding and Audio Contrastive Decoding identified as most effective methods, but impact varies significantly by model. CD reliably corrects errors where models falsely claim absence of audio or resort to uncertainty-driven guessing, but fails to correct flawed reasoning or confident misassertions.
Conclusion: Findings provide clear guidelines for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles, enabling more targeted application of decoding strategies.
Abstract: While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.
[425] Paralinguistic Emotion-Aware Validation Timing Detection in Japanese Empathetic Spoken Dialogue
Zi Haur Pang, Yahui Fu, Yuan Gao, Tatsuya Kawahara
Main category: cs.SD
TL;DR: A speech-based approach for detecting optimal timing of emotional validation in psychotherapy using paralinguistic and emotional cues without textual context, achieving improved performance over conventional baselines.
Details
Motivation: To enable more empathetic human-robot interaction by detecting appropriate timing for emotional validation (a psychotherapy technique) using only speech cues, without relying on textual content, which could enhance emotional support delivery.Method: Proposes a paralinguistic- and emotion-aware model that: 1) uses continued self-supervised training and fine-tuning on HuBERT backbones to obtain paralinguistics-aware SSL encoder and multi-task speech emotion classification encoder, 2) fuses these encoders, and 3) fine-tunes the combined model on validation timing detection task.
Result: Experimental evaluations on TUT Emotional Storytelling Corpus show significant improvements over conventional speech baselines, demonstrating that non-linguistic speech cues integrated with affect-related representations can effectively detect validation timing.
Conclusion: Non-linguistic speech cues combined with affect representations provide sufficient signal for determining when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.
Abstract: Emotional Validation is a psychotherapy communication technique that involves recognizing, understanding, and explicitly acknowledging another person’s feelings and actions, which strengthens alliance and reduces negative affect. To maximize the emotional support provided by validation, it is crucial to deliver it with appropriate timing and frequency. This study investigates validation timing detection from the speech perspective. Leveraging both paralinguistic and emotional information, we propose a paralinguistic- and emotion-aware model for validation timing detection without relying on textual context. Specifically, we first conduct continued self-supervised training and fine-tuning on different HuBERT backbones to obtain (i) a paralinguistics-aware Self-Supervised Learning (SSL) encoder and (ii) a multi-task speech emotion classification encoder. We then fuse these encoders and further fine-tune the combined model on the downstream validation timing detection task. Experimental evaluations on the TUT Emotional Storytelling Corpus (TESC) compare multiple models, fusion mechanisms, and training strategies, and demonstrate that the proposed approach achieves significant improvements over conventional speech baselines. Our results indicate that non-linguistic speech cues, when integrated with affect-related representations, carry sufficient signal to decide when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.
[426] TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control
Shihao He, Yihan Xia, Fang Liu, Taotao Wang, Shengli Zhang
Main category: cs.SD
TL;DR: Texture Resonance Retrieval (TRR) uses Wav2Vec2 Gram matrices for texture-aware audio effect preset retrieval, enabling editable plugin configurations rather than finalized waveforms.
Details
Motivation: There's a semantic gap between perceptual user intent and low-level signal-processing parameters in digital audio workstations. Current methods often output finalized waveforms rather than editable plugin configurations, limiting creative control.Method: TRR uses Gram matrices of projected mid-level Wav2Vec2 activations to preserve texture-relevant co-activation structure. Evaluated on guitar-effects benchmark with 1,063 presets and 204 queries using Protocol-A cross-validation to prevent train-test leakage.
Result: TRR achieves lowest normalized parameter error among evaluated methods (CLAP, Wav2Vec-RAG, Text-RAG, FeatureNN-RAG). Listening study with 26 participants provides complementary perceptual evidence. Results robust to trivial knowledge-base matches.
Conclusion: Texture-aware retrieval is useful for editable audio effect control, though broader personalization and real-audio robustness claims remain unverified. TRR bridges semantic gap between user intent and DSP parameters.
Abstract: Digital audio workstations expose rich effect chains, yet a semantic gap remains between perceptual user intent and low-level signal-processing parameters. We study retrieval-grounded audio effect control, where the output is an editable plugin configuration rather than a finalized waveform. Our focus is Texture Resonance Retrieval (TRR), an audio representation built from Gram matrices of projected mid-level Wav2Vec2 activations. This design preserves texture-relevant co-activation structure. We evaluate TRR on a guitar-effects benchmark with 1,063 candidate presets and 204 queries. The evaluation follows Protocol-A, a cross-validation scheme that prevents train-test leakage. We compare TRR against CLAP and internal retrieval baselines (Wav2Vec-RAG, Text-RAG, FeatureNN-RAG), using min-max normalized metrics grounded in physical DSP parameter ranges. Ablation studies validate TRR’s core design choices: projection dimensionality, layer selection, and projection type. A near-duplicate sensitivity analysis confirms that results are robust to trivial knowledge-base matches. TRR achieves the lowest normalized parameter error among evaluated methods. A multiple-stimulus listening study with 26 participants provides complementary perceptual evidence. We interpret these results as benchmark evidence that texture-aware retrieval is useful for editable audio effect control, while broader personalization and real-audio robustness claims remain outside the verified evidence presented here.
[427] Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis
Robin Doerfler, Lonce Wyse
Main category: cs.SD
TL;DR: PTR model generates engine sounds by directly modeling exhaust pressure pulses and their propagation through resonators, achieving better audio reconstruction than harmonic-based methods while providing physically interpretable parameters.
Details
Motivation: Traditional neural synthesis methods approximate spectral characteristics of engine sounds, but engine sounds actually originate from sequential exhaust pressure pulses. The authors want to directly model the underlying physical phenomena rather than just approximate spectral features.Method: Proposes Pulse-Train-Resonator (PTR) model: a differentiable synthesis architecture that generates parameterized pulse trains aligned to engine firing patterns, propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics, and integrates physics-informed inductive biases including harmonic decay, pitch modulation, valve dynamics, and exhaust resonances.
Result: Validated on 3 engine types (7.5 hours total audio), PTR achieves 21% improvement in harmonic reconstruction and 5.7% reduction in total loss over harmonic-plus-noise baseline, while providing interpretable parameters corresponding to physical phenomena.
Conclusion: Directly modeling the physical pulse-based nature of engine sounds with physics-informed inductive biases leads to better audio reconstruction and interpretable parameters compared to traditional spectral approximation methods.
Abstract: Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal structure. We present the Pulse-Train-Resonator (PTR) model, a differentiable synthesis architecture that generates engine audio as parameterized pulse trains aligned to engine firing patterns and propagates them through recursive Karplus-Strong resonators simulating exhaust acoustics. The architecture integrates physics-informed inductive biases including harmonic decay, thermodynamic pitch modulation, valve-dynamics envelopes, exhaust system resonances and derived engine operating modes such as throttle operation and deceleration fuel cutoff (DCFO). Validated on three diverse engine types totaling 7.5 hours of audio, PTR achieves a 21% improvement in harmonic reconstruction and a 5.7% reduction in total loss over a harmonic-plus-noise baseline model, while providing interpretable parameters corresponding to physical phenomena. Complete code, model weights, and audio examples are openly available.
[428] Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
Bin Gu, Haitao Zhao, Jibo Wei
Main category: cs.SD
TL;DR: Noise-conditioned mixture-of-experts framework for robust speaker verification that routes inputs to specialized noise-aware expert networks based on noise characteristics.
Details
Motivation: Robust speaker verification under noisy conditions remains challenging; conventional methods learn unified representations, but this paper argues for decomposing feature space into specialized noise-aware subspaces for better performance.Method: Proposes noise-conditioned expert routing mechanism, universal model-based expert specialization strategy, and SNR-decaying curriculum learning protocol to route inputs to expert networks based on noise information while preserving speaker identity.
Result: Comprehensive experiments demonstrate consistent superiority over baselines in speaker verification under diverse noise conditions.
Conclusion: The noise-conditioned mixture-of-experts framework effectively improves model robustness and generalization for speaker verification in noisy environments.
Abstract: Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines
[429] MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models
Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee
Main category: cs.SD
TL;DR: MUGEN benchmark reveals weaknesses in multi-audio understanding for large audio-language models, showing performance degradation with more concurrent audio inputs and proposing training-free strategies to improve performance.
Details
Motivation: Multi-audio understanding is critical for large audio-language models but remains underexplored, with current models showing weaknesses in handling multiple concurrent audio inputs across speech, general audio, and music domains.Method: Introduces MUGEN benchmark for evaluating multi-audio understanding, investigates training-free strategies including Audio-Permutational Self-Consistency (diversifying audio order) and combines it with Chain-of-Thought reasoning.
Result: Performance degrades sharply as number of concurrent audio inputs increases; Audio-Permutational Self-Consistency yields up to 6.28% accuracy gains; combined with Chain-of-Thought improves to 6.74% gains.
Conclusion: Exposes blind spots in current LALMs, identifies input scaling as fundamental bottleneck, and provides foundation for evaluating complex auditory comprehension with proposed strategies showing promising improvements.
Abstract: While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
[430] EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn Schuller
Main category: cs.SD
TL;DR: EmoSURA: A novel evaluation framework for emotional speech captioning that uses atomic verification of perceptual units against raw audio signals, addressing limitations of traditional metrics and LLM judges.
Details
Motivation: Traditional evaluation metrics for emotional speech captions fail to capture semantic nuances - N-gram metrics are inadequate, and LLM judges suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. There's also a scarcity of standardized evaluation resources.Method: EmoSURA shifts from holistic scoring to atomic verification by decomposing complex captions into Atomic Perceptual Units (self-contained statements about vocal/emotional attributes), then uses audio-grounded verification to validate each unit against raw speech signals. Also introduces SURABench as a balanced, stratified benchmark.
Result: EmoSURA achieves positive correlation with human judgments and offers more reliable assessment for long-form captions compared to traditional metrics, which showed negative correlations due to sensitivity to caption length.
Conclusion: EmoSURA provides a more reliable evaluation framework for emotional speech captioning by using atomic verification grounded in audio signals, addressing key limitations of existing evaluation methods.
Abstract: Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
[431] SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Laya Iyer, Angelina Wang, Sanmi Koyejo
Main category: cs.SD
TL;DR: SCENEBench benchmark evaluates audio understanding beyond ASR across spatial, cross-lingual, environmental, and non-speech categories to assess LALMs’ comprehensive audio comprehension.
Details
Motivation: Current LALMs lack comprehensive evaluation beyond speech recognition; need to measure broader audio understanding for accessibility technology and industrial noise monitoring applications.Method: Proposes SCENEBench benchmark suite with four categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. Uses synthetic audio samples with validation against natural audio for ecological validity.
Result: Evaluation of five state-of-the-art LALMs reveals critical gaps - performance varies widely across tasks, with some performing below random chance while others achieve high accuracy.
Conclusion: Benchmark identifies specific areas needing improvement in LALMs’ audio comprehension capabilities, providing direction for targeted model enhancements.
Abstract: Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
[432] VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu
Main category: cs.SD
TL;DR: VoiceBridge is a one-step latent bridge model for general speech restoration that can reconstruct 48kHz fullband speech from various distortions using a single latent-to-latent generative process.
Details
Motivation: Existing speech enhancement bridge models are mostly single-task and lack general speech restoration capability. The authors aim to develop a unified model that can handle diverse speech restoration tasks efficiently.Method: Proposes VoiceBridge with: 1) energy-preserving variational autoencoder for better waveform-latent alignment, 2) single latent-to-latent generative process using scalable transformer, 3) joint neural prior to handle different low-quality inputs, and 4) joint training of LBM, decoder and discriminator without distillation.
Result: Extensive validation shows superior performance across in-domain tasks (denoising, super-resolution) and out-of-domain tasks (refining synthesized speech) on various datasets.
Conclusion: VoiceBridge demonstrates effective general speech restoration capability through a unified latent bridge model that can handle diverse distortions with one-step generation.
Abstract: Bridge models have been investigated in speech enhancement but are mostly single-task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one-step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data-domain bridge models, we design an energy-preserving variational autoencoder, enhancing the waveform-latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~\textit{various} GSR tasks with a~\textit{single} latent-to-latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high-quality target from distinctively different low-quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator together, transforming the model from a denoiser to generator and enabling \textit{one-step GSR without distillation}. Extensive validation across in-domain (\textit{e.g.}, denoising and super-resolution) and out-of-domain tasks (\textit{e.g.}, refining synthesized speech) and datasets demonstrates the superior performance of VoiceBridge. Demos: https://VoiceBridgedemo.github.io/.
[433] LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment
Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue Wu
Main category: cs.SD
TL;DR: LARA-Gen enables continuous emotion control in text-to-music generation using latent affective representation alignment and valence-arousal space, outperforming baselines in emotion adherence and music quality.
Details
Motivation: While text-to-music models can generate coherent music from text prompts, they lack fine-grained emotional control. Current approaches struggle with precise emotional manipulation in generated music.Method: Proposes LARA-Gen framework with: 1) Latent Affective Representation Alignment (LARA) that aligns internal hidden states with an external music understanding model, 2) Emotion control module using continuous valence-arousal space to disentangle emotional attributes from textual content, and 3) A benchmark with curated test set and Emotion Predictor for objective evaluation.
Result: Extensive experiments show LARA-Gen achieves continuous, fine-grained emotion control and significantly outperforms baselines in both emotion adherence and music quality.
Conclusion: LARA-Gen successfully addresses the challenge of fine-grained emotional control in text-to-music generation through latent alignment and continuous emotion representation, establishing a benchmark for future research in emotionally controllable music generation.
Abstract: Recent advances in text-to-music models have enabled coherent music generation from text prompts, yet fine-grained emotional control remains unresolved. We introduce LARA-Gen, a framework for continuous emotion control that aligns the internal hidden states with an external music understanding model through Latent Affective Representation Alignment (LARA), enabling effective training. In addition, we design an emotion control module based on a continuous valence-arousal space, disentangling emotional attributes from textual content and bypassing the bottlenecks of text-based prompting. Furthermore, we establish a benchmark with a curated test set and a robust Emotion Predictor, facilitating objective evaluation of emotional controllability in music generation. Extensive experiments demonstrate that LARA-Gen achieves continuous, fine-grained control of emotion and significantly outperforms baselines in both emotion adherence and music quality. Generated samples are available at https://anonymous2232330.github.io/laragen-web/.
[434] Modeling strategies for speech enhancement in the latent space of a neural audio codec
Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive
Main category: cs.SD
TL;DR: Comparing continuous vs discrete neural audio codec representations for speech enhancement, finding continuous representations work better, with non-autoregressive models being more practical than autoregressive ones.
Details
Motivation: To investigate how continuous vector representations vs discrete token representations from neural audio codecs compare when used as training targets for supervised speech enhancement tasks.Method: Used both autoregressive and non-autoregressive Conformer-based speech enhancement models, plus a baseline of fine-tuning the NAC encoder directly. Compared continuous latent representation prediction vs discrete token prediction across these architectures.
Result: Three key findings: 1) Predicting continuous latent representations consistently outperforms discrete token prediction, 2) Autoregressive models achieve higher quality but sacrifice intelligibility and efficiency, making non-autoregressive models more practical, 3) Adding encoder fine-tuning yields strongest enhancement metrics but degrades codec reconstruction quality.
Conclusion: Continuous representations from neural audio codecs are superior training targets for speech enhancement, with non-autoregressive models offering the best practical trade-off between quality and efficiency.
Abstract: Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and adding encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.
[435] Scalable Neural Vocoder from Range-Null Space Decomposition
Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Xiaodong Li, Dong Yu, Chengshi Zheng
Main category: cs.SD
TL;DR: RNDVoC: A novel neural vocoder using range-null decomposition theory in time-frequency domain with dual-path framework for efficient and scalable speech synthesis.
Details
Motivation: To address challenges in neural vocoders: opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off.Method: Formulates spectrogram reconstruction using range-null decomposition theory, with range-space projecting mel-domain to linear-scale and null-space filling spectral details via neural networks. Uses dual-path framework with hierarchical encoding/decoding and cross/narrow-band modules for spectral modeling.
Result: Achieves state-of-the-art performance with lightweight structure and scalable inference paradigm across various benchmarks.
Conclusion: Proposed RNDVoC framework successfully addresses neural vocoder challenges while maintaining high performance and flexibility.
Abstract: Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inherent hurdles can heavily impede the development of this field. To resolve these problems, in this paper, we propose a novel neural vocoder in the time-frequency (T-F) domain. Specifically, we bridge the connection between the classical range-null decomposition (RND) theory and the vocoder task, where the reconstruction of the target spectrogram is formulated into the superimposition between range-space and null-space. The former aims to project the representation in the original mel-domain into the target linear-scale domain, and the latter can be instantiated via neural networks to further infill the spectral details. To fully leverage the spectrum prior, an elaborate dual-path framework is devised, where the spectrum is hierarchically encoded and decoded, and the cross- and narrow-band modules are leveraged for effectively modeling along sub-band and time dimensions. To enable inference under various configurations, we propose a simple yet effective strategy, which transforms the multi-condition adaption in the inference stage into the data augmentation in the training stage. Comprehensive experiments are conducted on various benchmarks. Quantitative and qualitative results show that while enjoying lightweight network structure and scalable inference paradigm, the proposed framework achieves state-ofthe-art performance among existing advanced methods. Code is available at https://github.com/Andong-Li-speech/RNDVoC.
[436] PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio
Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang
Main category: cs.SD
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to lack of paper contentMethod: Cannot determine method due to lack of paper content
Result: Cannot determine results due to lack of paper content
Conclusion: Cannot determine conclusion due to lack of paper content
Abstract: Failed to fetch summary for 2603.05128: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05128&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.LG
[437] Equitable Multi-Task Learning for AI-RANs
Panayiotis Raptis, Fatih Aslan, George Iosifidis
Main category: cs.LG
TL;DR: Online fair multi-task learning framework for AI-RANs that ensures long-term equity across users through nested learning loops with primal-dual updates.
Details
Motivation: AI-enabled Radio Access Networks need to serve heterogeneous users with time-varying learning tasks over shared edge resources while ensuring equitable inference performance across users, requiring adaptive and fair learning mechanisms.Method: OWO-FMTL (Online-within-Online Fair Multi-Task Learning) framework with two learning loops: outer loop updates shared model across rounds, inner loop rebalances user priorities within each round using lightweight primal-dual updates. Uses generalized alpha-fairness to quantify equity trade-offs.
Result: The framework guarantees diminishing performance disparity over time and operates with low computational overhead suitable for edge deployment. Experiments on convex and deep learning tasks show OWO-FMTL outperforms existing multi-task learning baselines under dynamic scenarios.
Conclusion: OWO-FMTL provides an effective solution for fair multi-task learning in AI-RANs, ensuring long-term equity across users while maintaining efficiency and low computational overhead for edge deployment.
Abstract: AI-enabled Radio Access Networks (AI-RANs) are expected to serve heterogeneous users with time-varying learning tasks over shared edge resources. Ensuring equitable inference performance across these users requires adaptive and fair learning mechanisms. This paper introduces an online-within-online fair multi-task learning (OWO-FMTL) framework that ensures long-term equity across users. The method combines two learning loops: an outer loop updating the shared model across rounds and an inner loop rebalancing user priorities within each round with a lightweight primal-dual update. Equity is quantified via generalized alpha-fairness, allowing a trade-off between efficiency and fairness. The framework guarantees diminishing performance disparity over time and operates with low computational overhead suitable for edge deployment. Experiments on convex and deep learning tasks confirm that OWO-FMTL outperforms existing multi-task learning baselines under dynamic scenarios.
[438] Hindsight Credit Assignment for Long-Horizon LLM Agents
Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, Yu-Feng Li
Main category: cs.LG
TL;DR: HCAPO introduces hindsight credit assignment to LLM agents for long-horizon tasks, using LLMs as post-hoc critics to refine step-level Q-values and multi-scale advantage mechanisms to address sparse reward challenges.
Details
Motivation: LLM agents struggle with credit assignment in long-horizon tasks due to sparse rewards. Existing methods like GRPO have inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states.Method: HCAPO integrates hindsight credit assignment by using the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning, and employs a multi-scale advantage mechanism to supplement inaccurate value baselines at critical decision states.
Result: HCAPO outperforms state-of-the-art RL methods across three benchmarks, achieving 7.7% improvement in success rate on WebShop and 13.8% on ALFWorld over GRPO using Qwen2.5-7B-Instruct model.
Conclusion: HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks by addressing fundamental credit assignment challenges in LLM agents.
Abstract: Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO’s multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.
[439] Generalized Reduction to the Isotropy for Flexible Equivariant Neural Fields
Alejandro García-Castellanos, Gijs Bellaard, Remco Duits, Daniel Pelt, Erik J Bekkers
Main category: cs.LG
TL;DR: The paper presents a mathematical framework for reducing invariant functions on product spaces to simpler isotropy subgroup actions, enabling extension of Equivariant Neural Fields to arbitrary group actions.
Details
Motivation: Many geometric learning problems involve heterogeneous product spaces with different group actions where standard techniques don't apply. Existing methods for Equivariant Neural Fields impose structural constraints that limit their applicability.Method: The authors show that when a group G acts transitively on space M, any G-invariant function on product space X×M can be reduced to an invariant of the isotropy subgroup H of M acting on X alone. They establish an explicit orbit equivalence (X×M)/G ≅ X/H, providing a principled reduction that preserves expressivity.
Result: This characterization enables extension of Equivariant Neural Fields to arbitrary group actions and homogeneous conditioning spaces, removing major structural constraints imposed by existing methods.
Conclusion: The framework provides a general mathematical foundation for handling invariant functions on heterogeneous product spaces, significantly expanding the applicability of equivariant neural networks in geometric learning problems.
Abstract: Many geometric learning problems require invariants on heterogeneous product spaces, i.e., products of distinct spaces carrying different group actions, where standard techniques do not directly apply. We show that, when a group $G$ acts transitively on a space $M$, any $G$-invariant function on a product space $X \times M$ can be reduced to an invariant of the isotropy subgroup $H$ of $M$ acting on $X$ alone. Our approach establishes an explicit orbit equivalence $(X \times M)/G \cong X/H$, yielding a principled reduction that preserves expressivity. We apply this characterization to Equivariant Neural Fields, extending them to arbitrary group actions and homogeneous conditioning spaces, and thereby removing the major structural constraints imposed by existing methods.
[440] SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning
Kaushik Roy, Giovanni D’urso, Nicholas Lawrance, Brendan Tidd, Peyman Moghadam
Main category: cs.LG
TL;DR: SPREAD: Geometry-preserving lifelong imitation learning framework using SVD alignment in low-rank subspaces and confidence-guided distillation to maintain task manifolds across sequential learning.
Details
Motivation: Existing lifelong imitation learning methods struggle with catastrophic forgetting and fail to preserve intrinsic task manifolds when learning new skills from expert demonstrations. Current distillation approaches using L2-norm feature matching are sensitive to noise and high-dimensional variability.Method: SPREAD uses singular value decomposition (SVD) to align policy representations across tasks within low-rank subspaces, preserving underlying geometry of multimodal features. Also introduces confidence-guided distillation with KL divergence loss restricted to top-M most confident action samples.
Result: Experiments on LIBERO benchmark show SPREAD substantially improves knowledge transfer, mitigates catastrophic forgetting, and achieves state-of-the-art performance in lifelong imitation learning.
Conclusion: SPREAD’s geometry-preserving approach through SVD alignment and confidence-guided distillation effectively maintains task manifolds, enabling stable knowledge transfer and addressing key challenges in lifelong imitation learning.
Abstract: A key challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining prior knowledge. This requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. Existing distillation methods, which rely on L2-norm feature matching in raw feature space, are sensitive to noise and high-dimensional variability, often failing to preserve intrinsic task manifolds. To address this, we introduce SPREAD, a geometry-preserving framework that employs singular value decomposition (SVD) to align policy representations across tasks within low-rank subspaces. This alignment maintains the underlying geometry of multimodal features, facilitating stable transfer, robustness, and generalization. Additionally, we propose a confidence-guided distillation strategy that applies a Kullback-Leibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable modes and improving optimization stability. Experiments on the LIBERO, lifelong imitation learning benchmark, show that SPREAD substantially improves knowledge transfer, mitigates catastrophic forgetting, and achieves state-of-the-art performance.
[441] Multi-level meta-reinforcement learning with skill-based curriculum
Sichen Yang, Mauro Maggioni
Main category: cs.LG
TL;DR: A multi-level MDP compression framework that hierarchically compresses policies into skills, enabling efficient solving of complex sequential decision problems through abstraction, transfer learning, and curriculum learning.
Details
Motivation: Address the challenge of systematically inferring and leveraging hierarchical structure in sequential decision making with natural multi-level structure, where complex goals are accomplished by assembling sub-tasks.Method: Multi-level procedure for repeatedly compressing MDPs where parametric policy families become single actions at higher levels, preserving semantic meaning and structure. Higher-level MDPs are independent with less stochasticity, enabling spatial/temporal coarsening. Policies factor into embeddings (problem-specific) and skills (including higher-order functions).
Result: The framework decouples sub-tasks, reduces unnecessary stochasticity and policy search space, leading to fewer iterations and computations. Enables skill transfer across problems and levels, framed within curriculum learning where difficulty gradually increases.
Conclusion: Provides a consistent framework for hierarchical abstraction, transferability, and curriculum learning in complex sequential decision problems, with guaranteed benefits under mild assumptions, demonstrated in examples including MazeBase+.
Abstract: We consider problems in sequential decision making with natural multi-level structure, where sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure has remained a longstanding challenge; we describe an efficient multi-level procedure for repeatedly compressing Markov decision processes (MDPs), wherein a parametric family of policies at one level is treated as single actions in the compressed MDPs at higher levels, while preserving the semantic meanings and structure of the original MDP, and mimicking the natural logic to address a complex MDP. Higher-level MDPs are themselves independent MDPs with less stochasticity, and may be solved using existing algorithms. As a byproduct, spatial or temporal scales may be coarsened at higher levels, making it more efficient to find long-term optimal policies. The multi-level representation delivered by this procedure decouples sub-tasks from each other and usually greatly reduces unnecessary stochasticity and the policy search space, leading to fewer iterations and computations when solving the MDPs. A second fundamental aspect of this work is that these multi-level decompositions plus the factorization of policies into embeddings (problem-specific) and skills (including higher-order functions) yield new transfer opportunities of skills across different problems and different levels. This whole process is framed within curriculum learning, wherein a teacher organizes the student agent’s learning process in a way that gradually increases the difficulty of tasks and and promotes transfer across MDPs and levels within and across curricula. The consistency of this framework and its benefits can be guaranteed under mild assumptions. We demonstrate abstraction, transferability, and curriculum learning in examples, including MazeBase+, a more complex variant of the MazeBase example.
[442] The Temporal Markov Transition Field
Michael Leznik
Main category: cs.LG
TL;DR: Temporal Markov Transition Field (TMTF) extends MTF by partitioning time series into temporal chunks, estimating local transition matrices for each chunk to preserve regime change information, creating images with distinct horizontal bands encoding segment-specific dynamics.
Details
Motivation: The original Markov Transition Field (MTF) loses information about when dynamical regimes change over time because it uses a single global transition matrix that averages across regimes. This is problematic for non-stationary processes with changing dynamics.Method: Partition time series into K contiguous temporal chunks, estimate separate local transition matrices for each chunk, and assemble the image so each row reflects local dynamics rather than global averages, creating T×T images with K horizontal bands of distinct texture.
Result: TMTF produces images with distinct horizontal bands encoding transition dynamics of each temporal segment, preserves information about when dynamical regimes were active, and is amplitude-agnostic and order-preserving for CNN applications.
Conclusion: TMTF addresses the limitation of MTF for non-stationary processes by preserving temporal regime information through local transition matrices, making it suitable for time series characterization tasks with convolutional neural networks.
Abstract: The Markov Transition Field (MTF), introduced by Wang and Oates (2015), encodes a time series as a two-dimensional image by mapping each pair of time steps to the transition probability between their quantile states, estimated from a single global transition matrix. This construction is efficient when the transition dynamics are stationary, but produces a misleading representation when the process changes regime over time: the global matrix averages across regimes and the resulting image loses all information about \emph{when} each dynamical regime was active. In this paper we introduce the \emph{Temporal Markov Transition Field} (TMTF), an extension that partitions the series into $K$ contiguous temporal chunks, estimates a separate local transition matrix for each chunk, and assembles the image so that each row reflects the dynamics local to its chunk rather than the global average. The resulting $T \times T$ image has $K$ horizontal bands of distinct texture, each encoding the transition dynamics of one temporal segment. We develop the formal definition, establish the key structural properties of the representation, work through a complete numerical example that makes the distinction from the global MTF concrete, analyse the bias–variance trade-off introduced by temporal chunking, and discuss the geometric interpretation of the local transition matrices in terms of process properties such as persistence, mean reversion, and trending behaviour. The TMTF is amplitude-agnostic and order-preserving, making it suitable as an input channel for convolutional neural networks applied to time series characterisation tasks.
[443] SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients
Anselm Paulus, A. René Geist, Vít Musil, Sebastian Hoffmann, Onur Beker, Georg Martius
Main category: cs.LG
TL;DR: SoftJAX and SoftTorch are open-source libraries providing soft differentiable replacements for hard primitives (thresholding, Boolean logic, indexing, sorting) in JAX and PyTorch, enabling gradient-based optimization where standard AD fails.
Details
Motivation: Many "hard" primitives in AD frameworks (thresholding, Boolean logic, discrete indexing, sorting) yield zero or undefined gradients, making them unsuitable for gradient-based optimization. While soft relaxations exist, implementations are fragmented across projects, making them difficult to combine and compare.Method: Develop SoftJAX and SoftTorch as feature-complete libraries providing soft differentiable functions as drop-in replacements for hard JAX/PyTorch counterparts. Includes: (i) elementwise operators (clip, abs), (ii) Boolean/index manipulation via fuzzy logic, (iii) axiswise operators (sort, rank) using optimal transport or permutahedron projections, and (iv) full straight-through gradient estimation support.
Result: Open-source libraries that make soft relaxations easily accessible for differentiable programming, demonstrated through benchmarking and a practical case study. Code available on GitHub.
Conclusion: SoftJAX and SoftTorch provide a comprehensive toolbox of soft differentiable functions that enable gradient-based optimization for previously “hard” operations, addressing fragmentation in the field and making these techniques more accessible.
Abstract: Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many “hard” primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous “soft” relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introduces SoftJAX and SoftTorch, open-source, feature-complete libraries for soft differentiable programming. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such as clip or abs, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such as sort or rank – based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study. Code is available at github.com/a-paulus/softjax and github.com/a-paulus/softtorch.
[444] Are Expressive Encoders Necessary for Discrete Graph Generation?
Jay Revolinsky, Harry Shomer, Jiliang Tang
Main category: cs.LG
TL;DR: GenGNN is a modular message-passing framework for discrete graph generation that achieves competitive performance with transformers while being 2-5x faster, with applications to molecule generation achieving 99.49% validity.
Details
Motivation: The paper addresses the design choice of using highly expressive neural backbones like transformers for discrete graph generation, proposing to revisit this approach with a more efficient GNN-based framework.Method: Introduces GenGNN, a modular message-passing framework for graph generation, used as backbone for diffusion models. Includes systematic ablation studies and scaling analyses with a metric-space view to investigate learned diffusion representations.
Result: GenGNN achieves >90% validity on Tree and Planar datasets (comparable to graph transformers) at 2-5x faster inference. For molecule generation, DiGress with GenGNN achieves 99.49% validity. Ablation shows residual connections help mitigate oversmoothing on complex graph structures.
Conclusion: GNNs can serve as expressive neural backbones for discrete diffusion models, offering competitive performance with transformers while being significantly faster, making them a viable alternative for graph generation tasks.
Abstract: Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.
[445] Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation
Jake Gonzales, Max Horwitz, Eric Mazumdar, Lillian J. Ratliff
Main category: cs.LG
TL;DR: RQRE-OVI: A scalable algorithm for computing Risk-Sensitive Quantal Response Equilibrium in general-sum Markov games with linear function approximation, offering unique, smooth solutions with provable convergence and robustness properties.
Details
Motivation: Nash equilibrium in general-sum Markov games is computationally intractable, brittle due to equilibrium multiplicity, and sensitive to approximation errors. There's a need for more robust, scalable equilibrium concepts that can handle large/continuous state spaces while providing unique solutions.Method: Proposes RQRE-OVI (Risk-Sensitive Quantal Response Equilibrium - Optimistic Value Iteration), an algorithm for computing RQRE with linear function approximation. Uses optimistic value iteration with finite-sample regret analysis to establish convergence properties.
Result: Establishes convergence with explicit sample complexity scaling with rationality and risk-sensitivity parameters. Shows RQRE policy map is Lipschitz continuous (unlike Nash), reveals Pareto frontier between performance and robustness, and demonstrates competitive self-play performance with substantially improved robustness in cross-play compared to Nash-based approaches.
Conclusion: RQRE-OVI offers a principled, scalable, tunable path for equilibrium learning with improved robustness and generalization, recovering Nash equilibrium as a special case in the limit of perfect rationality and risk neutrality.
Abstract: Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.
[446] Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models
John Cooper, Ilias Diakonikolas, Mingchen Ma, Frederic Sala
Main category: cs.LG
TL;DR: Hybrid models combining Transformers and state-space models achieve computational efficiency and expressive power that neither model alone can achieve for certain sequence tasks.
Details
Motivation: To understand when and why hybrid Transformer-state-space models outperform their constituent models, particularly for computational efficiency and expressive versatility in sequence modeling tasks.Method: Theoretical analysis of fundamental limitations for non-hybrid models on synthetic tasks, construction of provably efficient hybrid models for selective copying and associative recall, and empirical validation through experiments comparing learned hybrids against non-hybrid models.
Result: Hybrid models require fewer parameters and less working memory than pure Transformers or state-space models for certain tasks, with learned hybrids outperforming non-hybrids with up to 6x more parameters, and showing better length generalization and out-of-distribution robustness.
Conclusion: Hybrid Transformer-state-space models offer fundamental advantages over their constituent models for specific sequence tasks, achieving both computational efficiency and expressive power that neither model alone can provide.
Abstract: Hybrid sequence models–combining Transformer and state-space model layers–seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where–and underlying mechanisms through which–they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family–namely selective copying and associative recall–we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned–rather than constructed–hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.
[447] A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems
Mohammad Hossein Safarpour, Seyed Mohammad Alavi, Mohammad Izadikhah, Hossein Dibachi
Main category: cs.LG
TL;DR: A new feature selection method called FSbuHD that uses fuzzy rough set theory with combined distance calculations to handle high-dimensional data, reformulating feature selection as an optimization problem solvable by meta-heuristic algorithms.
Details
Motivation: Feature selection is crucial for big data applications to reduce dimensionality and improve decision-making. Traditional fuzzy rough set methods face computational challenges in high-dimensional spaces and produce noisy data, making feature selection difficult.Method: Proposed FSbuHD model calculates combined distance between objects to derive fuzzy equivalence relations, then reformulates feature selection as an optimization problem solvable by meta-heuristic algorithms. Operates in normal and optimistic modes based on two introduced fuzzy equivalence relations.
Result: Tested on standard UCI datasets and compared with other algorithms. FSbuHD demonstrated superior efficiency and effectiveness for feature selection compared to previous methods.
Conclusion: FSbuHD is an efficient and effective feature selection method for high-dimensional data that addresses computational challenges of traditional fuzzy rough set approaches through distance-based equivalence relations and optimization reformulation.
Abstract: Considering the high volume, wide variety, and rapid speed of data generation, investigating feature selection methods for big data presents various applications and advantages. By removing irrelevant and redundant features, feature selection reduces data dimensions, thereby facilitating optimal decision-making within decision systems. One of the key tools for feature selection in hybrid information systems is fuzzy rough set theory. However, this theory faces two significant challenges: First, obtaining fuzzy equivalence relations through intersection operations in high-dimensional spaces can be both time-consuming and memory-intensive. Additionally, this method may produce noisy data, complicating the feature selection process. The purpose and innovation of this paper are to address these issues. We proposed a new feature selection model that calculates the combined distance between objects and subsequently used this information to derive the fuzzy equivalence relation. Rather than directly solving the feature selection problem, this approach reformulates it into an optimization problem that can be tackled using appropriate meta-heuristic algorithms. We have named this new approach FSbuHD. The FSbuHD model operates in two modes - normal and optimistic - based on the selection of one of the two introduced fuzzy equivalence relations. The model is then tested on standard datasets from the UCI repository and compared with other algorithms. The results of this research demonstrate that FSbuHD is one of the most efficient and effective methods for feature selection when compared to previous methods and algorithms.
[448] Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting
Abhinaba Basu
Main category: cs.LG
TL;DR: A comprehensive analysis of nine selective prediction bound families with risk control, introducing Transfer-Informed Betting for tighter bounds in data-scarce settings via domain transfer.
Details
Motivation: The paper addresses the challenge of achieving reliable risk control in selective prediction systems, particularly in data-scarce scenarios where traditional concentration bounds become too conservative or infeasible.Method: Combines nine bound families (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test) and betting-based confidence sequences (WSR). Introduces Transfer-Informed Betting (TIB) that warm-starts WSR wealth process using source domain risk profiles.
Result: TIB achieves significantly tighter bounds in data-scarce settings: 27% relative improvement on MASSIVE at alpha=0.10, and 5.4x improvement on NyayaBench at alpha=0.10 compared to baseline methods. LTT eliminates union-bound penalty, achieving 94.0% guaranteed coverage vs 73.8% for Hoeffding.
Conclusion: The paper presents a novel three-way combination of betting-based confidence sequences, monotone testing, and cross-domain transfer that significantly improves selective prediction risk control, with practical applications in agentic caching systems.
Abstract: We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain’s risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.
[449] Quantifying Memorization and Privacy Risks in Genomic Language Models
Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang, Jaideep Vaidya, Erman Ayday
Main category: cs.LG
TL;DR: A comprehensive privacy evaluation framework for genomic language models that quantifies memorization risks through multi-vector assessment including perplexity detection, canary extraction, and membership inference.
Details
Motivation: Genomic language models trained on sensitive genomic data risk memorizing specific sequences, raising privacy concerns, but there's little systematic evaluation of these risks in the genomic domain where data have unique properties like fixed nucleotide alphabet and individual identifiability.Method: Developed a multi-vector privacy evaluation framework integrating three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference, combined into a unified pipeline that produces worst-case memorization risk scores. Used controlled evaluation by planting canary sequences at varying repetition rates in both synthetic and real genomic datasets.
Result: GLMs exhibit measurable memorization, with degree varying across architectures and training regimes. No single attack vector captures the full scope of memorization risk, revealing the need for multi-vector privacy auditing.
Conclusion: Multi-vector privacy auditing should be standard practice for genomic AI systems due to measurable memorization risks that vary across model architectures and training approaches.
Abstract: Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.
[450] Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates
Itamar Tsayag, Ofir Lindenbaum
Main category: cs.LG
TL;DR: Differentiable Bernoulli gates enable end-to-end optimization for discovering strong lottery tickets in neural networks without weight training, achieving up to 90% sparsity with minimal accuracy loss.
Details
Motivation: To overcome the limitations of non-differentiable score-based methods for Strong Lottery Ticket discovery, which hinder optimization efficiency and scalability for network sparsification.Method: Uses continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at initialization values.
Result: Achieves up to 90% sparsity with minimal accuracy loss across various architectures (FCNs, CNNs, Vision Transformers), nearly doubling the sparsity achieved by edge-popup at comparable accuracy.
Conclusion: Proposes the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations, establishing a scalable framework for pre-training network sparsification.
Abstract: Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an $\ell_0$-regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.
[451] The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference
Vignesh Adhinarayanan, Nuwan Jayasena
Main category: cs.LG
TL;DR: MoE models are efficient for training but suffer from inference inefficiency due to reuse fragmentation and KV cache memory constraints, making dense models more practical for long-context serving.
Details
Motivation: MoE models achieve high quality with low training FLOPs, but this efficiency disappears at inference time due to structural disadvantages in decoding, particularly for long contexts.Method: Introduces the $qs$ inequality criterion that predicts when MoE is disadvantaged relative to dense models, combining sparsity ($s$) and quality-equivalence factor ($q$). Evaluates across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C.
Result: DeepSeek-V3 at 128k context shows 4.5x throughput advantage for quality-matched dense baseline. Massive MoE architectures like Switch-C can become infeasible on clusters where dense models remain viable.
Conclusion: Training-time FLOP efficiency is incomplete for inference performance. MoE may be best as training-time optimization with distillation into dense models for efficient deployment.
Abstract: Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.
[452] Semantic Level of Detail: Multi-Scale Knowledge Representation via Heat Kernel Diffusion on Hyperbolic Manifolds
Edward Izgorodin
Main category: cs.LG
TL;DR: SLoD framework enables continuous resolution control in AI memory systems using heat kernel diffusion on Poincaré ball to define semantic zoom levels with automatic boundary detection.
Details
Motivation: AI memory systems organize knowledge into graph structures but lack principled mechanisms for continuous resolution control - determining where qualitative boundaries between abstraction levels lie and how agents should navigate them.Method: Introduces Semantic Level of Detail (SLoD) framework using heat kernel diffusion on Poincaré ball. At coarse scales, diffusion aggregates embeddings into high-level summaries; at fine scales, local semantic detail is preserved. Uses spectral gaps in graph Laplacian to detect emergent scale boundaries automatically.
Result: Proves hierarchical coherence with bounded approximation error O(σ) and (1+ε) distortion for tree-structured hierarchies. On synthetic hierarchies (HSBM), boundary scanner recovers planted levels with ARI up to 1.00. On WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth (τ=0.79).
Conclusion: SLoD provides a principled framework for continuous resolution control in knowledge graphs, automatically discovering meaningful abstraction levels without supervision through emergent scale boundaries detected from spectral properties.
Abstract: AI memory systems increasingly organize knowledge into graph structures – knowledge graphs, entity relations, community hierarchies – yet lack a principled mechanism for continuous resolution control: where do the qualitative boundaries between abstraction levels lie, and how should an agent navigate them? We introduce Semantic Level of Detail (SLoD), a framework that answers both questions by defining a continuous zoom operator via heat kernel diffusion on the Poincaré ball $\mathbb{B}^d$. At coarse scales ($σ\to \infty$), diffusion aggregates embeddings into high-level summaries; at fine scales ($σ\to 0$), local semantic detail is preserved. We prove hierarchical coherence with bounded approximation error $O(σ)$ and $(1+\varepsilon)$ distortion for tree-structured hierarchies under Sarkar embedding. Crucially, we show that spectral gaps in the graph Laplacian induce emergent scale boundaries – scales where the representation undergoes qualitative transitions – which can be detected automatically without manual resolution parameters. On synthetic hierarchies (HSBM), our boundary scanner recovers planted levels with ARI up to 1.00, with detection degrading gracefully near the information-theoretic Kesten-Stigum threshold. On the full WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth ($τ= 0.79$), demonstrating that the method discovers meaningful abstraction levels in real-world knowledge graphs without supervision.
[453] MAcPNN: Mutual Assisted Learning on Data Streams with Temporal Dependence
Federico Giannini, Emanuele Della Valle
Main category: cs.LG
TL;DR: Proposes Mutual Assisted Learning (MAcPNN) - a decentralized IoT learning paradigm where edge devices autonomously request assistance from peers when performance degrades due to concept drift, using Continuous Progressive Neural Networks for data streams.
Details
Motivation: IoT analytics faces challenges with continuous learning on data streams including concept drifts, temporal dependence, and catastrophic forgetting. Traditional federated learning requires constant communication, which is inefficient for edge devices. Need for autonomous, efficient collaborative learning among IoT devices.Method: Mutual Assisted Learning paradigm based on Vygotsky’s Sociocultural Theory, where each device autonomously decides when to request assistance from peers. Uses Continuous Progressive Neural Networks (cPNNs) to handle dynamic data streams. Implements MAcPNN with quantization for memory efficiency and single data point predictions.
Result: Experimental results show MAcPNN effectively boosts performance on synthetic and real data streams while drastically reducing communication compared to classical federated learning approaches.
Conclusion: MAcPNN provides an efficient decentralized learning framework for IoT edge devices that reduces communication overhead while maintaining performance through selective peer assistance during concept drifts.
Abstract: Internet of Things (IoT) Analytics often involves applying machine learning (ML) models on data streams. In such scenarios, traditional ML paradigms face obstacles related to continuous learning while dealing with concept drifts, temporal dependence, and avoiding forgetting. Moreover, in IoT, different edge devices build up a network. When learning models on those devices, connecting them could be useful in improving performance and reusing others’ knowledge. This work proposes Mutual Assisted Learning, a learning paradigm grounded on Vygotsky’s popular Sociocultural Theory of Cognitive Development. Each device is autonomous and does not need a central orchestrator. Whenever it degrades its performance due to a concept drift, it asks for assistance from others and decides whether their knowledge is useful for solving the new problem. This way, the number of connections is drastically reduced compared to the classical Federated Learning approaches, where the devices communicate at each training round. Every device is equipped with a Continuous Progressive Neural Network (cPNN) to handle the dynamic nature of data streams. We call this implementation Mutual Assisted cPNN (MAcPNN). To implement it, we allow cPNNs for single data point predictions and apply quantization to reduce the memory footprint. Experimental results prove the effectiveness of MAcPNN in boosting performance on synthetic and real data streams.
[454] MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment
Kailong Fan, Anqi Pu, Yichen Wu, Wanhua Li, Yicong Li, Hanspeter Pfister, Huafeng Liu, Xiang Li, Quanzheng Li, Ning Guo
Main category: cs.LG
TL;DR: A novel training paradigm that integrates medical process reward models with Test-Time Reinforcement Learning to replace unreliable majority voting with expert-aligned supervision for better medical reasoning.
Details
Motivation: Standard Test-Time Reinforcement Learning in medical LLMs relies on majority voting as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path isn't necessarily clinically correct.Method: Proposes a unified training paradigm integrating medical process reward models (Med-RPM) with TTRL, replacing conventional majority voting with fine-grained, expert-aligned supervision to guide reinforcement learning by medical correctness rather than consensus.
Result: Extensive evaluations on four different benchmarks demonstrate the method consistently and significantly outperforms current TTRL and standalone PRM selection approaches.
Conclusion: Transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems.
Abstract: Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model’s parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems
[455] The Coupling Within: Flow Matching via Distilled Normalizing Flows
David Berthelot, Tianrong Chen, Jiatao Gu, Marco Cuturi, Laurent Dinh, Bhavik Chandna, Michal Klein, Josh Susskind, Shuangfei Zhai
Main category: cs.LG
TL;DR: NFM uses pretrained normalizing flow models to provide quasi-deterministic noise-data couplings for training student flow models, outperforming both independent/OT couplings and the teacher model.
Details
Motivation: Current flow matching methods rely on independent or optimal transport couplings, but these may not be optimal. Normalizing flows naturally provide bijective mappings between noise and data spaces through their invertibility, offering potentially superior couplings for training flow models.Method: Proposes Normalized Flow Matching (NFM) which distills the quasi-deterministic coupling from pretrained normalizing flow models (specifically auto-regressive NF models) to train student flow models, leveraging the inherent bijective properties of NFs.
Result: Student models trained with NFM significantly outperform flow models trained with independent or optimal transport couplings, and also improve upon the teacher auto-regressive NF model itself.
Conclusion: NFM demonstrates that leveraging the inherent coupling properties of pretrained normalizing flows provides superior training signals for flow matching, achieving state-of-the-art performance in flow-based generation.
Abstract: Flow models have rapidly become the go-to method for training and deploying large-scale generators, owing their success to inference-time flexibility via adjustable integration steps. A crucial ingredient in flow training is the choice of coupling measure for sampling noise/data pairs that define the flow matching (FM) regression loss. While FM training defaults usually to independent coupling, recent works show that adaptive couplings informed by noise/data distributions (e.g., via optimal transport, OT) improve both model training and inference. We radicalize this insight by shifting the paradigm: rather than computing adaptive couplings directly, we use distilled couplings from a different, pretrained model capable of placing noise and data spaces in bijection – a property intrinsic to normalizing flows (NF) through their maximum likelihood and invertibility requirements. Leveraging recent advances in NF image generation via auto-regressive (AR) blocks, we propose Normalized Flow Matching (NFM), a new method that distills the quasi-deterministic coupling of pretrained NF models to train student flow models. These students achieve the best of both worlds: significantly outperforming flow models trained with independent or even OT couplings, while also improving on the teacher AR-NF model.
[456] Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions
Isaac Remy, David Fridovich-Keil, Karen Leung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2410.07409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.07409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] An accurate flatness measure to estimate the generalization performance of CNN models
Rahman Taleghani, Maryam Mohammadi, Francesco Marchetti
Main category: cs.LG
TL;DR: A novel flatness measure for CNNs that accounts for architectural structure and scaling symmetries, derived from Hessian trace analysis of convolutional layers with global average pooling.
Details
Motivation: Existing flatness measures for generalization are either tailored to fully connected networks, rely on stochastic Hessian trace estimators, or ignore the specific geometric structure of modern CNNs. There's a need for an exact and architecturally faithful flatness measure for CNNs.Method: Derived closed-form expression for Hessian trace of cross-entropy loss with respect to convolutional kernels in networks using global average pooling + linear classifier. Specialized relative flatness notion to convolutional layers, accounting for scaling symmetries and filter interactions from convolution and pooling.
Result: Empirical investigation on CNN families trained on standard image-classification benchmarks shows the measure can assess and compare generalization performance, and guide architecture/training design.
Conclusion: The proposed flatness measure is both exact and architecturally faithful for CNNs, serving as a robust tool for generalization assessment and practical design guidance.
Abstract: Flatness measures based on the spectrum or the trace of the Hessian of the loss are widely used as proxies for the generalization ability of deep networks. However, most existing definitions are either tailored to fully connected architectures, relying on stochastic estimators of the Hessian trace, or ignore the specific geometric structure of modern Convolutional Neural Networks (CNNs). In this work, we develop a flatness measure that is both exact and architecturally faithful for a broad and practically relevant class of CNNs. We first derive a closed-form expression for the trace of the Hessian of the cross-entropy loss with respect to convolutional kernels in networks that use global average pooling followed by a linear classifier. Building on this result, we then specialize the notion of relative flatness to convolutional layers and obtain a parameterization-aware flatness measure that properly accounts for the scaling symmetries and filter interactions induced by convolution and pooling. Finally, we empirically investigate the proposed measure on families of CNNs trained on standard image-classification benchmarks. The results obtained suggest that the proposed measure can serve as a robust tool to assess and compare the generalization performance of CNN models, and to guide the design of architecture and training choices in practice.
[458] Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction
Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mondal, Prasanta Kumar Ghosh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.15484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency
Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai
Main category: cs.LG
TL;DR: CALIPER is a data-only test that estimates the required post-drift data size for stable retraining in streaming learning, using weighted local regression and tracking proxy error trends to determine when sufficient data is available for adaptation.
Details
Motivation: Current drift detection methods don't address when to retrain models or how much post-drift data is needed for stable retraining, creating a gap between detection and effective adaptation in streaming learning scenarios.Method: CALIPER uses a single-pass weighted local regression over post-drift windows, tracking a one-step proxy error as a function of locality parameter θ. When an effective sample size gate is satisfied, a monotonically non-increasing trend in error indicates sufficient data for retraining.
Result: Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates.
Conclusion: CALIPER effectively bridges the gap between drift detection and data-sufficient adaptation in streaming learning, providing a practical solution for determining when and how much to retrain after concept drift.
Abstract: Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $θ$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.
[460] TCG CREST System Description for the DISPLACE-M Challenge
Nikhil Raghav, Md Sahidullah
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2603.02030
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.02030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] Two Teachers Better Than One: Hardware-Physics Co-Guided Distributed Scientific Machine Learning
Yuchen Yuan, Junhuan Yang, Hao Wan, Yipei Liu, Hanhan Wu, Youzuo Lin, Lei Yang
Main category: cs.LG
TL;DR: EPIC is a distributed SciML framework that enables efficient full-waveform inversion by performing lightweight local encoding on edge devices and physics-aware decoding at a central node, reducing communication costs while maintaining physical fidelity.
Details
Motivation: Traditional centralized SciML approaches face challenges in wide-area sensing due to high communication latency and energy costs from raw data aggregation. Distributed ML models often break physical principles, leading to degraded performance.Method: EPIC uses hardware- and physics-co-guided distributed SciML with lightweight local encoding on end devices and physics-aware decoding at a central node. It transmits compact latent features instead of raw data and employs cross-attention to capture inter-receiver wavefield coupling.
Result: On a distributed testbed with 5 end devices and 1 central node across 10 OpenFWI datasets, EPIC reduces latency by 8.9× and communication energy by 33.8×, while improving reconstruction fidelity on 8 out of 10 datasets.
Conclusion: EPIC demonstrates that distributed SciML can achieve significant communication efficiency gains while preserving physical fidelity, enabling practical in-field SciML applications with strict energy and latency constraints.
Abstract: Scientific machine learning (SciML) is increasingly applied to in-field processing, controlling, and monitoring; however, wide-area sensing, real-time demands, and strict energy and reliability constraints make centralized SciML implementation impractical. Most SciML models assume raw data aggregation at a central node, incurring prohibitively high communication latency and energy costs; yet, distributing models developed for general-purpose ML often breaks essential physical principles, resulting in degraded performance. To address these challenges, we introduce EPIC, a hardware- and physics-co-guided distributed SciML framework, using full-waveform inversion (FWI) as a representative task. EPIC performs lightweight local encoding on end devices and physics-aware decoding at a central node. By transmitting compact latent features rather than high-volume raw data and by using cross-attention to capture inter-receiver wavefield coupling, EPIC significantly reduces communication cost while preserving physical fidelity. Evaluated on a distributed testbed with five end devices and one central node, and across 10 datasets from OpenFWI, EPIC reduces latency by 8.9$\times$ and communication energy by 33.8$\times$, while even improving reconstruction fidelity on 8 out of 10 datasets.
[462] SCALAR: Learning and Composing Skills through LLM Guided Symbolic Planning and Deep RL Grounding
Renos Zabounidis, Yue Wu, Simon Stepputtis, Woojun Kim, Yuanzhi Li, Tom Mitchell, Katia Sycara
Main category: cs.LG
TL;DR: SCALAR is a bidirectional framework that couples LLM planning with RL through a learned skill library, enabling iterative refinement of skill specifications via RL feedback to improve grounding of language into low-level control.
Details
Motivation: LLM-based agents excel with high-level APIs but struggle to ground language into low-level control. Existing approaches use LLMs to generate skills or reward functions for RL in a one-shot manner, lacking feedback mechanisms to correct specification errors.Method: SCALAR introduces a bidirectional framework where LLMs propose skills with preconditions and effects, RL trains policies for each skill, and execution results feed back to iteratively refine specifications. Includes Pivotal Trajectory Analysis to correct LLM priors and Frontier Checkpointing to save environment states at skill boundaries for sample efficiency.
Result: On Craftax, SCALAR achieves 88.2% diamond collection (1.9x improvement over best baseline) and reaches the Gnomish Mines 9.1% of the time where prior methods fail entirely.
Conclusion: SCALAR demonstrates that bidirectional coupling between LLM planning and RL through iterative specification refinement enables robust grounding of language into low-level control, overcoming limitations of one-shot approaches.
Abstract: LM-based agents excel when given high-level action APIs but struggle to ground language into low-level control. Prior work has LLMs generate skills or reward functions for RL, but these one-shot approaches lack feedback to correct specification errors. We introduce SCALAR, a bidirectional framework coupling LLM planning with RL through a learned skill library. The LLM proposes skills with preconditions and effects; RL trains policies for each skill and feeds back execution results to iteratively refine specifications, improving robustness to initial errors. Pivotal Trajectory Analysis corrects LLM priors by analyzing RL trajectories; Frontier Checkpointing optionally saves environment states at skill boundaries to improve sample efficiency. On Craftax, SCALAR achieves 88.2% diamond collection, a 1.9x improvement over the best baseline, and reaches the Gnomish Mines 9.1% of the time where prior methods fail entirely.
[463] Personalized Collaborative Learning with Affinity-Based Variance Reduction
Chenyu Zhang, Navid Azizan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2510.16232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] Sim2Act: Robust Simulation-to-Decision Learning via Adversarial Calibration and Group-Relative Perturbation
Hongyu Cao, Jinghan Zhang, Kunpeng Liu, Dongjie Wang, Feng Xia, Haifeng Chen, Xiaohua Hu, Yanjie Fu
Main category: cs.LG
TL;DR: Sim2Act: A robust simulation-to-decision framework that addresses both simulator and policy robustness through adversarial calibration and group-relative perturbation strategies to handle noisy/bias in learned simulators.
Details
Motivation: Simulation-to-decision learning is essential for safe policy training in mission-critical domains, but simulators learned from noisy/biased real-world data often have prediction errors in decision-critical regions, leading to unstable action ranking and unreliable policies. Existing approaches either focus on average simulation fidelity or use conservative regularization that may discard high-risk high-reward actions.Method: 1) Adversarial calibration mechanism that re-weights simulation errors in decision-critical state-action pairs to align surrogate fidelity with downstream decision impact. 2) Group-relative perturbation strategy that stabilizes policy learning under simulator uncertainty without enforcing overly pessimistic constraints.
Result: Extensive experiments on multiple supply chain benchmarks demonstrate improved simulation robustness and more stable decision performance under both structured and unstructured perturbations.
Conclusion: Sim2Act provides a robust framework for simulation-to-decision learning that addresses both simulator and policy robustness, enabling more reliable policy training in digital environments for mission-critical applications.
Abstract: Simulation-to-decision learning enables safe policy training in digital environments without risking real-world deployment, and has become essential in mission-critical domains such as supply chains and industrial systems. However, simulators learned from noisy or biased real-world data often exhibit prediction errors in decision-critical regions, leading to unstable action ranking and unreliable policies. Existing approaches either focus on improving average simulation fidelity or adopt conservative regularization, which may cause policy collapse by discarding high-risk high-reward actions. We propose Sim2Act, a robust simulation-to-decision framework that addresses both simulator and policy robustness. First, we introduce an adversarial calibration mechanism that re-weights simulation errors in decision-critical state-action pairs to align surrogate fidelity with downstream decision impact. Second, we develop a group-relative perturbation strategy that stabilizes policy learning under simulator uncertainty without enforcing overly pessimistic constraints. Extensive experiments on multiple supply chain benchmarks demonstrate improved simulation robustness and more stable decision performance under structured and unstructured perturbations.
[465] Dynamic Multi-period Experts for Online Time Series Forecasting
Seungha Hong, Sukang Chae, Suyeon Kim, Sanghwan Jang, Hwanjo Yu
Main category: cs.LG
TL;DR: DynaME is a hybrid framework for online time series forecasting that addresses two types of concept drift: Recurring Drift (reappearing patterns) and Emergent Drift (new patterns), using specialized experts for recurring patterns and a general expert for emergent patterns.
Details
Motivation: Existing online time series forecasting methods treat concept drift as a monolithic phenomenon, failing to distinguish between different types of drift patterns that require different adaptation strategies.Method: Proposes DynaME (Dynamic Multi-period Experts) with two components: 1) Committee of specialized experts dynamically fitted to relevant historical periodic patterns for Recurring Drift, and 2) Stable general expert activated during high-uncertainty scenarios for Emergent Drift.
Result: Extensive experiments on benchmark datasets show DynaME effectively adapts to both concept drifts and significantly outperforms existing baselines.
Conclusion: Categorizing concept drift into Recurring and Emergent types enables more effective adaptation strategies, and DynaME’s hybrid approach successfully addresses both drift types in online time series forecasting.
Abstract: Online Time Series Forecasting (OTSF) requires models to continuously adapt to concept drift. However, existing methods often treat concept drift as a monolithic phenomenon. To address this limitation, we first redefine concept drift by categorizing it into two distinct types: Recurring Drift, where previously seen patterns reappear, and Emergent Drift, where entirely new patterns emerge. We then propose DynaME (Dynamic Multi-period Experts), a novel hybrid framework designed to effectively address this dual nature of drift. For Recurring Drift, DynaME employs a committee of specialized experts that are dynamically fitted to the most relevant historical periodic patterns at each time step. For Emergent Drift, the framework detects high-uncertainty scenarios and shifts reliance to a stable, general expert. Extensive experiments on several benchmark datasets and backbones demonstrate that DynaME effectively adapts to both concept drifts and significantly outperforms existing baselines.
[466] Learning Adaptive LLM Decoding
Chloe H. Su, Zhe Ye, Samuel Tenka, Aidan Yang, Soonho Kong, Udaya Ghai
Main category: cs.LG
TL;DR: Adaptive decoding policies for LLMs that dynamically select sampling strategies based on task difficulty and compute resources, trained with reinforcement learning and verifiable rewards.
Details
Motivation: Current LLM decoding uses fixed sampling hyperparameters despite varying task difficulty and uncertainty across prompts and decoding steps, leading to suboptimal performance under compute constraints.Method: Introduces lightweight decoding adapters trained with RL and verifiable terminal rewards. Two approaches: 1) Sequence-level as contextual bandit selecting decoding strategy per prompt, 2) Token-level as POMDP selecting sampling actions per token based on model features and remaining budget.
Result: On MATH and CodeContests benchmarks, token-level adapter improves Pass@1 accuracy by up to 10.2% over best static baseline under fixed token budget; sequence-level adapter yields 2-3% gains under fixed parallel sampling.
Conclusion: Learned adaptive decoding policies significantly improve accuracy-budget tradeoffs for LLMs, with both sequence- and token-level adaptation contributing to performance gains.
Abstract: Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
[467] Exclusive Self Attention
Shuangfei Zhai
Main category: cs.LG
TL;DR: XSA improves Transformer self-attention by restricting attention to orthogonal information, excluding self-position data for better context modeling in language tasks.
Details
Motivation: Standard self-attention in Transformers includes self-position information which may not be optimal for context modeling. The authors propose that excluding self-information could improve sequence modeling performance.Method: Introduces exclusive self attention (XSA) that constrains attention to capture only information orthogonal to each token’s own value vector, effectively excluding self-position information while encouraging better context modeling.
Result: XSA consistently outperforms standard self-attention on language modeling tasks across model sizes up to 2.7B parameters, with gains increasing as sequence length grows.
Conclusion: Excluding self-information in attention mechanisms can improve Transformer performance for sequence modeling tasks, suggesting new directions for attention architecture design.
Abstract: We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer’s sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token’s own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.
[468] PPO-Based Hybrid Optimization for RIS-Assisted Semantic Vehicular Edge Computing
Wei Feng, Jingbo Zhang, Qiong Wu, Pingyi Fan, Qiang Fan
Main category: cs.LG
TL;DR: RIS-aided semantic-aware VEC framework for IoV applications using RIS for connectivity optimization and semantic communication to reduce latency, with hybrid PPO-LP optimization scheme.
Details
Motivation: To support latency-sensitive IoV applications in dynamic environments with intermittent links by integrating RIS for wireless connectivity optimization and semantic communication to minimize transmission latency.Method: Proposes a RIS-aided semantic-aware VEC framework that formulates a joint optimization problem for offloading ratios, semantic symbols, and RIS phase shifts. Uses a two-tier hybrid scheme with PPO for discrete decision-making and LP for offloading optimization.
Result: The PPO-based hybrid scheme reduces average end-to-end latency by 40-50% compared to GA and QPSO, and maintains low latency even in congested scenarios with up to 30 vehicles.
Conclusion: The proposed framework effectively reduces latency for IoV applications through RIS-aided semantic communication and hybrid optimization, demonstrating strong scalability in congested scenarios.
Abstract: To support latency-sensitive Internet of Vehicles (IoV) applications amidst dynamic environments and intermittent links, this paper proposes a Reconfigurable Intelligent Surface (RIS)-aided semantic-aware Vehicle Edge Computing (VEC) framework. This approach integrates RIS to optimize wireless connectivity and semantic communication to minimize latency by transmitting semantic features. We formulate a comprehensive joint optimization problem by optimizing offloading ratios, the number of semantic symbols, and RIS phase shifts. Considering the problem’s high dimensionality and non-convexity, we propose a two-tier hybrid scheme that employs Proximal Policy Optimization (PPO) for discrete decision-making and Linear Programming (LP) for offloading optimization. {The simulation results have validated the proposed framework’s superiority over existing methods. Specifically, the proposed PPO-based hybrid optimization scheme reduces the average end-to-end latency by approximately 40% to 50% compared to Genetic Algorithm (GA) and Quantum-behaved Particle Swarm Optimization (QPSO). Moreover, the system demonstrates strong scalability by maintaining low latency even in congested scenarios with up to 30 vehicles.
[469] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting
Alvaro Paredes Amorin, Andre Python, Christoph Weisser
Main category: cs.LG
TL;DR: Finetuned LLMs extract sentiment from English/Chinese news to forecast aluminum prices, outperforming traditional models during high volatility periods when combined with tabular data.
Details
Motivation: Textual data captures market sentiment crucial for commodity price forecasting, but the effectiveness of lightweight finetuned LLMs for aluminum price prediction and specific market conditions where they work best remains underexplored.Method: Generate monthly sentiment scores from English/Chinese news headlines using finetuned Qwen3 model, integrate with traditional tabular data (metal indices, exchange rates, inflation, energy prices), evaluate through long-short simulations on Shanghai Metal Exchange (2007-2024).
Result: During high volatility periods, LSTM models with sentiment data from finetuned Qwen3 (Sharpe ratio 1.04) significantly outperform baseline models using only tabular data (Sharpe ratio 0.23). Analysis reveals nuanced roles of news sources, topics, and event types.
Conclusion: Finetuned LLMs effectively extract predictive sentiment signals for aluminum prices, especially valuable during volatile market conditions, with specific news characteristics influencing forecasting performance.
Abstract: By capturing the prevailing sentiment and market mood, textual data has become increasingly vital for forecasting commodity prices, particularly in metal markets. However, the effectiveness of lightweight, finetuned large language models (LLMs) in extracting predictive signals for aluminum prices, and the specific market conditions under which these signals are most informative, remains under-explored. This study generates monthly sentiment scores from English and Chinese news headlines (Reuters, Dow Jones Newswires, and China News Service) and integrates them with traditional tabular data, including base metal indices, exchange rates, inflation rates, and energy prices. We evaluate the predictive performance and economic utility of these models through long-short simulations on the Shanghai Metal Exchange from 2007 to 2024. Our results demonstrate that during periods of high volatility, Long Short-Term Memory (LSTM) models incorporating sentiment data from a finetuned Qwen3 model (Sharpe ratio 1.04) significantly outperform baseline models using tabular data alone (Sharpe ratio 0.23). Subsequent analysis elucidates the nuanced roles of news sources, topics, and event types in aluminum price forecasting.
[470] Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms
Renos Zabounidis, Roy Siegelmann, Mohamad Qadri, Woojun Kim, Simon Stepputtis, Katia P. Sycara
Main category: cs.LG
TL;DR: Action masking in RL with state-dependent action validity outperforms penalty methods by preventing systematic suppression of valid actions at unvisited states due to parameter sharing in softmax policies.
Details
Motivation: The paper addresses a gap in understanding why action masking consistently outperforms penalty-based handling of invalid actions in RL environments with state-dependent action validity. While existing theory only shows that masking preserves the policy gradient theorem, the authors identify a distinct failure mode of unmasked training that hasn't been theoretically explained.Method: The authors provide theoretical analysis proving that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state, the probability of that action at the unvisited state is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. They validate empirically that deep networks exhibit the feature alignment condition required for suppression, and conduct experiments on Craftax, Craftax-Classic, and MiniHack environments.
Result: The theoretical analysis reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that action masking eliminates. Experiments confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.
Conclusion: Action masking is superior to penalty methods because it prevents systematic suppression of valid actions at unvisited states, which occurs due to parameter sharing in neural networks. The paper provides both theoretical understanding and practical validation of why masking works better, and shows feasibility classification can enable deployment without requiring oracle masks.
Abstract: In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^$, the probability $π(a \mid s^)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.
[471] Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
Runyao Yu, Viviana Kleine, Philipp Gromotka, Thomas Rudolf, Adrian Eisenmann, Gautham Ram Chandra Mouli, Peter Palensky, Jochen L. Cremer
Main category: cs.LG
TL;DR: Data-driven probabilistic hysteresis factor prediction for silicon-graphite anode batteries to improve state-of-charge estimation, with focus on uncertainty quantification and computational efficiency.
Details
Motivation: Silicon-graphite anode batteries have higher energy density but exhibit pronounced voltage hysteresis, making state-of-charge estimation challenging. Existing approaches don't address uncertainty quantification or computational constraints for these advanced batteries.Method: Proposed data harmonization framework to standardize heterogeneous driving cycles, then applied statistical learning and deep learning models for probabilistic hysteresis factor prediction with uncertainty quantification. Evaluated generalizability through retraining, zero-shot prediction, fine-tuning, and joint training across different vehicle models.
Result: Extensive experiments evaluated model performance in predicting hysteresis factor with uncertainties while considering computational efficiency. The optimal model configuration was assessed for generalizability to unseen vehicle models through various training strategies.
Conclusion: The research addresses key challenges in state-of-charge estimation for silicon-graphite anode batteries, facilitating adoption of advanced battery technologies through improved hysteresis modeling with uncertainty quantification.
Abstract: Batteries with silicon-graphite-based anodes, which offer higher energy density and improved charging performance, introduce pronounced voltage hysteresis, making state-of-charge (SoC) estimation particularly challenging. Existing approaches to modeling hysteresis rely on exhaustive high-fidelity tests or focus on conventional graphite-based lithium-ion batteries, without considering uncertainty quantification or computational constraints. This work introduces a data-driven approach for probabilistic hysteresis factor prediction, with a particular emphasis on applications involving silicon-graphite anode-based batteries. A data harmonization framework is proposed to standardize heterogeneous driving cycles across varying operating conditions. Statistical learning and deep learning models are applied to assess performance in predicting the hysteresis factor with uncertainties while considering computational efficiency. Extensive experiments are conducted to evaluate the generalizability of the optimal model configuration in unseen vehicle models through retraining, zero-shot prediction, fine-tuning, and joint training. By addressing key challenges in SoC estimation, this research facilitates the adoption of advanced battery technologies. A summary page is available at: https://runyao-yu.github.io/Porsche_Hysteresis_Factor_Prediction/
[472] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun
Main category: cs.LG
TL;DR: DCPO framework decouples reasoning and calibration objectives to address over-confidence issues in RLVR-trained LLMs, achieving better calibration while maintaining accuracy.
Details
Motivation: RLVR improves LLM reasoning but causes calibration degeneration where models become over-confident in wrong answers. Existing methods directly combine calibration objectives with accuracy optimization, but there's a fundamental gradient conflict between these objectives.Method: Proposes DCPO (Decoupled Calibration Policy Optimization) framework that systematically decouples reasoning and calibration objectives. Based on theoretical analysis showing gradient conflict between maximizing policy accuracy and minimizing calibration error.
Result: DCPO preserves accuracy comparable to GRPO while achieving best calibration performance and substantially mitigating over-confidence issues in LLMs trained with RLVR.
Conclusion: Provides valuable insights and practical solution for more reliable LLM deployment by addressing the fundamental conflict between reasoning optimization and calibration objectives.
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
[473] Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
Zhen Zhang, Jielei Chu, Tianrui Li
Main category: cs.LG
TL;DR: A causal regularization method for Class Incremental Learning that addresses feature collision through Probability of Necessity and Sufficiency analysis and dual-scope counterfactual generation.
Details
Motivation: Current expansion-based CIL methods freeze old features but still suffer from feature collision due to spurious correlations. These correlations occur both within tasks (intra-task) causing non-robust features, and between tasks (inter-task) causing semantic confusion between visually similar classes.Method: Proposes a PNS-based regularization method with CPNS (extended PNS for CIL) to quantify causal completeness and separability. Uses dual-scope counterfactual generator with twin networks to simultaneously generate intra-task counterfactual features (minimizing intra-task PNS risk) and inter-task interfering features (minimizing inter-task PNS risk).
Result: Theoretical analyses confirm reliability. Extensive experiments demonstrate effectiveness as a plug-and-play method for expansion-based CIL to mitigate feature collision.
Conclusion: The proposed CPNS-based regularization effectively addresses feature collision in expansion-based CIL by tackling both intra-task and inter-task spurious correlations through causal analysis and counterfactual feature generation.
Abstract: Current expansion-based methods for Class Incremental Learning (CIL) effectively mitigate catastrophic forgetting by freezing old features. However, such task-specific features learned from the new task may collide with the old features. From a causal perspective, spurious feature correlations are the main cause of this collision, manifesting in two scopes: (i) guided by empirical risk minimization (ERM), intra-task spurious correlations cause task-specific features to rely on shortcut features. These non-robust features are vulnerable to interference, inevitably drifting into the feature space of other tasks; (ii) inter-task spurious correlations induce semantic confusion between visually similar classes across tasks. To address this, we propose a Probability of Necessity and Sufficiency (PNS)-based regularization method to guide feature expansion in CIL. Specifically, we first extend the definition of PNS to expansion-based CIL, termed CPNS, which quantifies both the causal completeness of intra-task representations and the separability of inter-task representations. We then introduce a dual-scope counterfactual generator based on twin networks to ensure the measurement of CPNS, which simultaneously generates: (i) intra-task counterfactual features to minimize intra-task PNS risk and ensure causal completeness of task-specific features, and (ii) inter-task interfering features to minimize inter-task PNS risk, ensuring the separability of inter-task representations. Theoretical analyses confirm its reliability. The regularization is a plug-and-play method for expansion-based CIL to mitigate feature collision. Extensive experiments demonstrate the effectiveness of the proposed method.
[474] Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL
Siyang Cai, Cangyuan Li, Yinhe Han, Ying Wang
Main category: cs.LG
TL;DR: Using imperfect LLM-generated RTL code as synthetic training data for netlist representation learning, overcoming data scarcity in circuit analysis.
Details
Motivation: Circuit netlist representation learning is constrained by scarce labeled datasets due to IP protection and annotation costs. Existing methods focus on small-scale circuits with clean labels, limiting scalability to realistic designs. LLMs can generate RTL at scale but produce functionally incorrect code, hindering their use in circuit analysis.Method: Proposes a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-generated RTL as training data. The key insight is that even functionally imperfect LLM-generated RTL, when synthesized to netlists, preserves structural patterns indicative of intended functionality. Forms an end-to-end pipeline from automated code generation to downstream tasks.
Result: Evaluations on circuit functional understanding tasks (sub-circuit boundary identification and component classification) across benchmarks of increasing scales show that models trained on noisy synthetic corpus generalize well to real-world netlists. The approach matches or surpasses methods trained on scarce high-quality data, effectively breaking the data bottleneck in circuit representation learning.
Conclusion: LLM-generated RTL, despite functional imperfections, provides valuable structural patterns for netlist representation learning. The proposed framework enables scalable training data generation and addresses the data scarcity problem in circuit analysis, extending task scope from operator-level to IP-level.
Abstract: Learning effective netlist representations is fundamentally constrained by the scarcity of labeled datasets, as real designs are protected by Intellectual Property (IP) and costly to annotate. Existing work therefore focuses on small-scale circuits with clean labels, limiting scalability to realistic designs. Meanwhile, Large Language Models (LLMs) can generate Register-Transfer-Level (RTL) at scale, but their functional incorrectness has hindered their use in circuit analysis. In this work, we make a key observation: even when LLM-Generated RTL is functionally imperfect, the synthesized netlists still preserve structural patterns that are strongly indicative of the intended functionality. Building on this insight, we propose a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-Generated RTL as training data for netlist representation learning, forming an end-to-end pipeline from automated code generation to downstream tasks. We conduct evaluations on circuit functional understanding tasks, including sub-circuit boundary identification and component classification, across benchmarks of increasing scales, extending the task scope from operator-level to IP-level. The evaluations demonstrate that models trained on our noisy synthetic corpus generalize well to real-world netlists, matching or even surpassing methods trained on scarce high-quality data and effectively breaking the data bottleneck in circuit representation learning.
[475] GIAT: A Geologically-Informed Attention Transformer for Lithology Identification
Jie Li, Qishun Yang, Nuo Li
Main category: cs.LG
TL;DR: GIAT integrates geological priors into Transformer attention for more accurate and interpretable lithology identification from well logs.
Details
Motivation: Transformers for sequence modeling lack geological guidance and interpretability, limiting trustworthiness in subsurface resource evaluation applications.Method: Geologically-Informed Attention Transformer (GIAT) uses CSC filters to generate geological relational matrices that bias self-attention toward geologically coherent patterns.
Result: Achieves 95.4% accuracy on challenging datasets, outperforming existing models with better interpretability and geological coherence.
Conclusion: GIAT provides a new paradigm for building accurate, reliable, and interpretable deep learning models for geoscience applications.
Abstract: Accurate lithology identification from well logs is crucial for subsurface resource evaluation. Although Transformer-based models excel at sequence modeling, their “black-box” nature and lack of geological guidance limit their performance and trustworthiness. To overcome these limitations, this letter proposes the Geologically-Informed Attention Transformer (GIAT), a novel framework that deeply fuses data-driven geological priors with the Transformer’s attention mechanism. The core of GIAT is a new attention-biasing mechanism. We repurpose Category-Wise Sequence Correlation (CSC) filters to generate a geologically-informed relational matrix, which is injected into the self-attention calculation to explicitly guide the model toward geologically coherent patterns. On two challenging datasets, GIAT achieves state-of-the-art performance with an accuracy of up to 95.4%, significantly outperforming existing models. More importantly, GIAT demonstrates exceptional interpretation faithfulness under input perturbations and generates geologically coherent predictions. Our work presents a new paradigm for building more accurate, reliable, and interpretable deep learning models for geoscience applications.
[476] Better Bounds for the Distributed Experts Problem
David P. Woodruff, Samson Zhou
Main category: cs.LG
TL;DR: A distributed online learning protocol for experts across servers with ℓ_p norm losses, achieving improved regret bounds with minimal communication.
Details
Motivation: The paper addresses the distributed experts problem where experts are distributed across multiple servers, aiming to minimize regret while minimizing communication overhead between servers.Method: Develops a distributed protocol that achieves regret roughly R ≳ 1/(√T·poly log(nsT)) using O((n/R² + s/R²)·max(s^{1-2/p},1)·poly log(nsT)) bits of communication.
Result: The protocol improves on previous work by achieving better regret bounds with reduced communication requirements for distributed expert learning across servers.
Conclusion: The paper presents an efficient distributed learning protocol for the experts problem that balances regret minimization with communication efficiency across distributed servers.
Abstract: In this paper, we study the distributed experts problem, where $n$ experts are distributed across $s$ servers for $T$ timesteps. The loss of each expert at each time $t$ is the $\ell_p$ norm of the vector that consists of the losses of the expert at each of the $s$ servers at time $t$. The goal is to minimize the regret $R$, i.e., the loss of the distributed protocol compared to the loss of the best expert, amortized over the all $T$ times, while using the minimum amount of communication. We give a protocol that achieves regret roughly $R\gtrsim\frac{1}{\sqrt{T}\cdot\text{poly}\log(nsT)}$, using $\mathcal{O}\left(\frac{n}{R^2}+\frac{s}{R^2}\right)\cdot\max(s^{1-2/p},1)\cdot\text{poly}\log(nsT)$ bits of communication, which improves on previous work.
[477] Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning
Lina Berrayana, Ahmed Heakl, Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Main category: cs.LG
TL;DR: Latent-DARM enables collaboration between discrete diffusion language models (planners) and autoregressive language models (executors) through latent-space communication, improving reasoning performance while reducing token usage.
Details
Motivation: Autoregressive language models (ARMs) are limited in global reasoning and plan revision, while discrete diffusion language models (DDLMs) have strong planning capabilities but poor text fluency. There's a need to bridge these heterogeneous models for better multi-agent collaboration.Method: Latent-space communication framework that connects DDLM planners with ARM executors, allowing them to collaborate without requiring DDLMs to generate fluent text directly.
Result: Outperforms text-based interfaces, improving accuracy from 27.0% to 36.0% on DART-5 and from 0.0% to 14.0% on AIME2024. Approaches state-of-the-art reasoning models while using less than 2.2% of token budget.
Conclusion: Latent-DARM advances multi-agent collaboration among heterogeneous models by enabling effective communication between planning-focused DDLMs and execution-focused ARMs through latent representations.
Abstract: Most multi-agent systems rely exclusively on autoregressive language models (ARMs) that are based on sequential generation. Although effective for fluent text, ARMs limit global reasoning and plan revision. On the other hand, Discrete Diffusion Language Models (DDLMs) enable non-sequential, globally revisable generation and have shown strong planning capabilities, but their limited text fluency hinders direct collaboration with ARMs. We introduce Latent-DARM, a latent-space communication framework bridging DDLM (planners) and ARM (executors), maximizing collaborative benefits. Across mathematical, scientific, and commonsense reasoning benchmarks, Latent-DARM outperforms text-based interfaces on average, improving accuracy from 27.0% to 36.0% on DART-5 and from 0.0% to 14.0% on AIME2024. Latent-DARM approaches the results of state-of-the-art reasoning models while using less than 2.2% of its token budget. This work advances multi-agent collaboration among agents with heterogeneous models.
[478] $P^2$GNN: Two Prototype Sets to boost GNN Performance
Arihant Jain, Gundeep Arora, Anoop Saladi, Chaosheng Dong
Main category: cs.LG
TL;DR: P²GNN introduces a plug-and-play technique using prototypes to enhance message-passing GNNs by providing global context and denoising local neighborhoods, improving performance on node recommendation and classification tasks.
Details
Motivation: MP-GNNs face two key limitations: (1) heavy reliance on local context without global graph-level features, and (2) assumption of strong homophily among connected nodes, making them vulnerable to noisy local neighborhoods.Method: P²GNN uses prototypes in two ways: as universally accessible neighbors for all nodes to enrich global context, and by aligning messages to clustered prototypes to provide denoising effects. The method is plug-and-play and extensible to all message-passing GNNs.
Result: Extensive experiments across 18 datasets show P²GNN outperforms production models in e-commerce and achieves top average rank on open-source datasets for node recommendation and classification tasks.
Conclusion: P²GNN establishes itself as a leading approach by effectively addressing global context limitations and noise mitigation in local neighborhoods, with qualitative analysis supporting these improvements.
Abstract: Message Passing Graph Neural Networks (MP-GNNs) have garnered attention for addressing various industry challenges, such as user recommendation and fraud detection. However, they face two major hurdles: (1) heavy reliance on local context, often lacking information about the global context or graph-level features, and (2) assumption of strong homophily among connected nodes, struggling with noisy local neighborhoods. To tackle these, we introduce $P^2$GNN, a plug-and-play technique leveraging prototypes to optimize message passing, enhancing the performance of the base GNN model. Our approach views the prototypes in two ways: (1) as universally accessible neighbors for all nodes, enriching global context, and (2) aligning messages to clustered prototypes, offering a denoising effect. We demonstrate the extensibility of our proposed method to all message-passing GNNs and conduct extensive experiments across 18 datasets, including proprietary e-commerce datasets and open-source datasets, on node recommendation and node classification tasks. Results show that $P^2$GNN outperforms production models in e-commerce and achieves the top average rank on open-source datasets, establishing it as a leading approach. Qualitative analysis supports the value of global context and noise mitigation in the local neighborhood in enhancing performance.
[479] The Radio-Frequency Transformer for Signal Separation
Egor Lifar, Semyon Savkin, Rachana Madhukara, Tejas Jayashankar, Yury Polyanskiy, Gregory W. Wornell
Main category: cs.LG
TL;DR: Data-driven signal separation method using discrete tokenization and transformer training with cross-entropy loss, achieving strong performance on RF signal separation with zero-shot generalization.
Details
Motivation: To address the problem of separating signals of interest from unknown non-Gaussian background/interference using fully data-driven approaches, moving beyond conventional methods like MSE loss.Method: Modified SoundStream tokenizer with additional transformer layers and finite-scalar quantization (FSQ), followed by end-to-end transformer training using cross-entropy loss instead of MSE.
Result: Achieved competitive performance on MIT RF Challenge dataset, including 122x reduction in bit-error rate over prior SOTA for separating QPSK signals from 5G interference, with zero-shot generalization to unseen mixtures.
Conclusion: The method shows strong signal separation capabilities with cross-entropy training outperforming MSE, and has potential applications beyond RF to other scientific sensing problems like gravitational-wave data.
Abstract: We study a problem of signal separation: estimating a signal of interest (SOI) contaminated by an unknown non-Gaussian background/interference. Given the training data consisting of examples of SOI and interference, we show how to build a fully data-driven signal separator. To that end we learn a good discrete tokenizer for SOI and then train an end-to-end transformer on a cross-entropy loss. Training with a cross-entropy shows substantial improvements over the conventional mean-squared error (MSE). Our tokenizer is a modification of Google’s SoundStream, which incorporates additional transformer layers and switches from VQVAE to finite-scalar quantization (FSQ). Across real and synthetic mixtures from the MIT RF Challenge dataset, our method achieves competitive performance, including a 122x reduction in bit-error rate (BER) over prior state-of-the-art techniques for separating a QPSK signal from 5G interference. The learned representation adapts to the interference type without side information and shows zero-shot generalization to unseen mixtures at inference time, underscoring its potential beyond RF. Although we instantiate our approach on radio-frequency mixtures, we expect the same architecture to apply to gravitational-wave data (e.g., LIGO strain) and other scientific sensing problems that require data-driven modeling of background and noise.
[480] Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control
Peihao Wang, Shan Yang, Xijun Wang, Tesi Xiao, Xin Liu, Changlong Yu, Yu Lou, Pan Li, Zhangyang Wang, Ming Lin, René Vidal
Main category: cs.LG
TL;DR: TTC layer embeds optimal control planning into LLMs at inference time using LQR over latent states, improving mathematical reasoning performance without test-time training.
Details
Motivation: Current language models lack native planning capabilities for reasoning tasks, requiring external methods like reinforcement learning or test-time training. The authors aim to embed planning directly into model architecture through optimal control principles.Method: Introduces Test-Time Control (TTC) layer that performs finite-horizon Linear Quadratic Regulator (LQR) planning over latent states at inference time. Uses symplectic formulation for hardware-efficient LQR solver implemented as fused CUDA kernel. Integrated as adapter into pretrained LLMs.
Result: Improves mathematical reasoning performance by up to +27.8% on MATH-500 and achieves 2-3x Pass@8 improvements on AMC and AIME benchmarks.
Conclusion: Embedding optimal control as architectural component provides effective and scalable mechanism for reasoning beyond test-time training, enabling planning before prediction within neural architectures.
Abstract: Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.
[481] Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training
Rian Atri
Main category: cs.LG
TL;DR: Training-only components (RPA attention prior and Guardian controller) improve Transformer reasoning under tight compute constraints without increasing inference cost.
Details
Motivation: Address efficient reasoning under tight compute constraints - how to make structured, correct decisions without increasing test time cost, especially for small/medium Transformers.Method: Two training-only components: 1) RPA (fuzzy regime position alignment) - length-aware attention prior as normalized pre-softmax bias; 2) Guardian - minimal gain-aware controller that nudges attention sharpness based on validation improvements. Both transfer to broader differentiable optimizers and are disabled at inference.
Result: Reduces validation cross entropy on WikiText 2 while matching baseline latency and memory. Inference adds only cached bias per head with negligible overhead and no measurable p50 latency shift.
Conclusion: Length-aware priors and late-phase gain control preserve improvements in long-span, noisy logit regimes while keeping test time costs effectively unchanged.
Abstract: We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior B of T as a single additive bias per head. The controller does not run. In practice, this incurs negligible overhead, a cached bias add per head, with no measurable p50 latency shift. Our results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.
[482] Transductive Generalization via Optimal Transport and Its Application to Graph Node Classification
MoonJeong Park, Seungbeom Lee, Kyungmin Kim, Jaeseung Heo, Seunghyuk Cho, Shouheng Li, Sangdon Park, Dongwoo Kim
Main category: cs.LG
TL;DR: New representation-based generalization bounds for transductive learning using optimal transport, specifically Wasserstein distances between encoded feature distributions, with applications to graph neural networks.
Details
Motivation: Existing transductive bounds rely on classical complexity measures that are computationally intractable and often misaligned with empirical behavior. There's a need for more practical and accurate generalization bounds in transductive settings where learned representations are dependent and test features are accessible during training.Method: Derive global and class-wise generalization bounds via optimal transport, expressed in terms of Wasserstein distances between encoded feature distributions. The approach is specifically applied to graph neural networks, analyzing how the GNN aggregation process transforms representation distributions.
Result: The bounds are efficiently computable and strongly correlate with empirical generalization in graph node classification, improving upon classical complexity measures. The analysis reveals how GNN aggregation induces a trade-off between intra-class concentration and inter-class separation, yielding depth-dependent characterizations that capture the non-monotonic relationship between depth and generalization error.
Conclusion: The proposed optimal transport-based bounds provide practical and accurate generalization measures for transductive learning, particularly for graph neural networks, offering insights into the relationship between network depth, representation distributions, and generalization performance.
Abstract: Many existing transductive bounds rely on classical complexity measures that are computationally intractable and often misaligned with empirical behavior. In this work, we establish new representation-based generalization bounds in a distribution-free transductive setting, where learned representations are dependent, and test features are accessible during training. We derive global and class-wise bounds via optimal transport, expressed in terms of Wasserstein distances between encoded feature distributions. We demonstrate that our bounds are efficiently computable and strongly correlate with empirical generalization in graph node classification, improving upon classical complexity measures. Additionally, our analysis reveals how the GNN aggregation process transforms the representation distributions, inducing a trade-off between intra-class concentration and inter-class separation. This yields depth-dependent characterizations that capture the non-monotonic relationship between depth and generalization error observed in practice. The code is available at https://github.com/ml-postech/Transductive-OT-Gen-Bound.
[483] DendroNN: Dendrocentric Neural Networks for Energy-Efficient Classification of Event-Based Data
Jann Krausse, Zhe Su, Kyrus Mama, Maryada, Klaus Knobloch, Giacomo Indiveri, Jürgen Becker
Main category: cs.LG
TL;DR: DendroNN: A novel spiking neural network inspired by dendritic sequence detection mechanisms for efficient event-based spatiotemporal processing with competitive accuracy on time series datasets.
Details
Motivation: Feed-forward spiking neural networks struggle with accurate temporal information decoding and often resort to recurrence or delays that reduce hardware efficiency. Dendrites in biological systems offer computational capabilities that could enhance temporal processing in machine learning systems.Method: Introduces DendroNN, a dendrocentric neural network that identifies unique incoming spike sequences as spatiotemporal features. Uses a rewiring phase to train non-differentiable spike sequences without gradients, memorizing frequent sequences and discarding non-discriminative ones. Proposes asynchronous digital hardware architecture with time-wheel mechanism for event-driven design.
Result: DendroNN achieves competitive accuracies across various event-based time series datasets. The hardware architecture achieves up to 4x higher efficiency than state-of-the-art neuromorphic hardware at comparable accuracy on audio classification tasks.
Conclusion: DendroNN offers a novel approach to low-power spatiotemporal processing on event-driven hardware by leveraging dendritic computational principles, achieving both accuracy and hardware efficiency improvements.
Abstract: Spatiotemporal information is at the core of diverse sensory processing and computational tasks. Feed-forward spiking neural networks can be used to solve these tasks while offering potential benefits in terms of energy efficiency by computing event-based. However, they have trouble decoding temporal information with high accuracy. Thus, they commonly resort to recurrence or delays to enhance their temporal computing ability which, however, bring downsides in terms of hardware-efficiency. In the brain, dendrites are computational powerhouses that just recently started to be acknowledged in such machine learning systems. In this work, we focus on a sequence detection mechanism present in branches of dendrites and translate it into a novel type of neural network by introducing a dendrocentric neural network, DendroNN. DendroNNs identify unique incoming spike sequences as spatiotemporal features. This work further introduces a rewiring phase to train the non-differentiable spike sequences without the use of gradients. During the rewiring, the network memorizes frequently occurring sequences and additionally discards those that do not contribute any discriminative information. The networks display competitive accuracies across various event-based time series datasets. We also propose an asynchronous digital hardware architecture using a time-wheel mechanism that builds on the event-driven design of DendroNNs, eliminating per-step global updates typical of delay- or recurrence-based models. By leveraging a DendroNN’s dynamic and static sparsity along with intrinsic quantization, it achieves up to 4x higher efficiency than state-of-the-art neuromorphic hardware at comparable accuracy on the same audio classification task, demonstrating its suitability for spatiotemporal event-based computing. This work offers a novel approach to low-power spatiotemporal processing on event-driven hardware.
[484] Proxy-Guided Measurement Calibration
Saketh Vishnubhatla, Shu Wan, Andre Harrison, Adrienne Raglin, Huan Liu
Main category: cs.LG
TL;DR: A framework for correcting systematic measurement errors in aggregate outcome variables using proxy variables and variational autoencoders to disentangle content from bias latents.
Details
Motivation: Aggregate outcome variables from surveys and administrative records often have systematic measurement errors due to varying data collection capacities, reporting practices, and event characteristics, which complicates downstream analysis and decision-making.Method: Proposes a causal graph model separating latent content variables (driving true outcomes) from latent bias variables (inducing systematic errors). Uses proxy variables independent of bias mechanism to identify bias. Introduces two-stage approach with variational autoencoders to disentangle content and bias latents for bias estimation.
Result: Evaluated on synthetic data, semi-synthetic datasets from randomized trials, and real-world disaster loss reporting case study. The framework successfully estimates and corrects systematic measurement errors.
Conclusion: The proposed framework provides a principled approach to address outcome miscalibration by leveraging proxy variables and disentangled latent representations, enabling more accurate analysis of systematically mismeasured data.
Abstract: Aggregate outcome variables collected through surveys and administrative records are often subject to systematic measurement error. For instance, in disaster loss databases, county-level losses reported may differ from the true damages due to variations in on-the-ground data collection capacity, reporting practices, and event characteristics. Such miscalibration complicates downstream analysis and decision-making. We study the problem of outcome miscalibration and propose a framework guided by proxy variables for estimating and correcting the systematic errors. We model the data-generating process using a causal graph that separates latent content variables driving the true outcome from the latent bias variables that induce systematic errors. The key insight is that proxy variables that depend on the true outcome but are independent of the bias mechanism provide identifying information for quantifying the bias. Leveraging this structure, we introduce a two-stage approach that utilizes variational autoencoders to disentangle content and bias latents, enabling us to estimate the effect of bias on the outcome of interest. We analyze the assumptions underlying our approach and evaluate it on synthetic data, semi-synthetic datasets derived from randomized trials, and a real-world case study of disaster loss reporting.
[485] A Gaussian Comparison Theorem for Training Dynamics in Machine Learning
Ashkan Panahi
Main category: cs.LG
TL;DR: The paper presents a theoretical analysis of training algorithms for Gaussian mixture data, connecting model evolution to surrogate dynamical systems using Gordon comparison theorem, with applications to perceptron training.
Details
Motivation: To provide rigorous theoretical understanding of training dynamics for algorithms operating on Gaussian mixture data, bridging asymptotic and non-asymptotic regimes with precise mathematical analysis.Method: Uses Gordon comparison theorem to connect algorithm evolution to surrogate dynamical systems, develops dynamic mean-field expressions for asymptotic scenarios, and proposes iterative refinement for non-asymptotic cases.
Result: Rigorous proof of dynamic mean-field expressions in asymptotic scenarios, identification of additional fluctuation parameters in non-asymptotic domains beyond DMF kernels.
Conclusion: Theoretical framework successfully connects training dynamics to surrogate systems, enabling rigorous analysis of both asymptotic and non-asymptotic regimes for Gaussian mixture data.
Abstract: We study training algorithms with data following a Gaussian mixture model. For a specific family of such algorithms, we present a non-asymptotic result, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze. The proof of our result is based on the celebrated Gordon comparison theorem. Using our theorem, we rigorously prove the validity of the dynamic mean-field (DMF) expressions in the asymptotic scenarios. Moreover, we suggest an iterative refinement scheme to obtain more accurate expressions in non-asymptotic scenarios. We specialize our theory to the analysis of training a perceptron model with a generic first-order (full-batch) algorithm and demonstrate that fluctuation parameters in a non-asymptotic domain emerge in addition to the DMF kernels.
[486] Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning
Heng Zhang, Haddy Alchaer, Arash Ajoudani, Yu She
Main category: cs.LG
TL;DR: Reward-Zero is a language-driven implicit reward mechanism that uses semantic embeddings to generate progress signals for RL agents from task descriptions, improving training efficiency and generalization.
Details
Motivation: Traditional RL often relies on sparse or delayed environmental rewards that require task-specific engineering. There's a need for more general, semantically grounded reward mechanisms that can accelerate exploration and improve sample efficiency across diverse tasks.Method: Reward-Zero transforms natural-language task descriptions into dense progress signals by comparing language embeddings of task specifications with embeddings derived from agent interaction experiences. This creates a continuous, semantically aligned completion signal that supplements environmental feedback without task-specific engineering.
Result: Agents trained with Reward-Zero converge faster and achieve higher success rates than conventional methods like PPO with common reward-shaping baselines. The method successfully solves tasks that hand-designed rewards could not in complex scenarios, and a benchmark was developed for evaluating completion sense via language embeddings.
Conclusion: Language-driven implicit reward functions like Reward-Zero offer a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents by leveraging semantic understanding of task progress.
Abstract: We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent’s interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.
[487] ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Main category: cs.LG
TL;DR: ACTIVEULTRAFEEDBACK is an active learning pipeline that uses uncertainty estimates to efficiently select informative responses for RLHF annotation, reducing data requirements by up to 6x while maintaining or improving model performance.
Details
Motivation: RLHF is standard for aligning LLMs but suffers from high annotation costs, especially in low-resource and expert domains. Current methods require large amounts of expensive preference data, creating a bottleneck for effective alignment.Method: A modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Introduces two novel selection methods: DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, which prioritize response pairs with large predicted quality gaps based on recent findings that such pairs provide strong signals for fine-tuning.
Result: The pipeline produces high-quality datasets that lead to significant downstream performance improvements. Achieves comparable or superior results with as little as one-sixth of the annotated data compared to static baselines.
Conclusion: ACTIVEULTRAFEEDBACK effectively addresses the data efficiency problem in RLHF, enabling more cost-effective alignment of LLMs through intelligent active learning strategies.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.
[488] TA-GGAD: Testing-time Adaptive Graph Model for Generalist Graph Anomaly Detection
Xiong Zhang, Hong Peng, Changlong Fu, Xin Jin, Yun Yang, Cheng Xie
Main category: cs.LG
TL;DR: A graph foundation model for anomaly detection that addresses cross-domain generalization by identifying and modeling the Anomaly Disassortativity issue, achieving SOTA performance across diverse real-world graphs with single training.
Details
Motivation: Real-world anomalous nodes (fake news, malicious transactions, etc.) harm graph ecosystems, but cross-domain detection faces severe domain shift issues that limit generalizability across different domains.Method: Identifies and quantitatively analyzes Anomaly Disassortativity (AD) issue in domain shift, then introduces a novel graph foundation model that addresses this issue to achieve cross-domain generalization with single training.
Result: Achieves breakthrough cross-domain adaptation with pioneering SOTA detection accuracy across fourteen diverse real-world graphs, demonstrating strong generalization capability.
Conclusion: The AD theory provides novel theoretical perspective and practical route for future research in generalist graph anomaly detection (GGAD), enabling effective cross-domain anomaly detection.
Abstract: A significant number of anomalous nodes in the real world, such as fake news, noncompliant users, malicious transactions, and malicious posts, severely compromises the health of the graph data ecosystem and urgently requires effective identification and processing. With anomalies that span multiple data domains yet exhibit vast differences in features, cross-domain detection models face severe domain shift issues, which limit their generalizability across all domains. This study identifies and quantitatively analyzes a specific feature mismatch pattern exhibited by domain shift in graph anomaly detection, which we define as the \emph{Anomaly Disassortativity} issue ($\mathcal{AD}$). Based on the modeling of the issue $\mathcal{AD}$, we introduce a novel graph foundation model for anomaly detection. It achieves cross-domain generalization in different graphs, requiring only a single training phase to perform effectively across diverse domains. The experimental findings, based on fourteen diverse real-world graphs, confirm a breakthrough in the model’s cross-domain adaptation, achieving a pioneering state-of-the-art (SOTA) level in terms of detection accuracy. In summary, the proposed theory of $\mathcal{AD}$ provides a novel theoretical perspective and a practical route for future research in generalist graph anomaly detection (GGAD). The code is available at https://anonymous.4open.science/r/Anonymization-TA-GGAD/.
[489] Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning
Yechen Zhang, Shuhao Xing, Junhao Huang, Kai Lv, Yunhua Zhou, Xipeng Qiu, Qipeng Guo, Kai Chen
Main category: cs.LG
TL;DR: Mousse is a novel optimizer that combines spectral optimization with second-order preconditioning by using Kronecker-factored statistics from Shampoo to create an anisotropic trust region, outperforming Muon in training efficiency for language models.
Details
Motivation: Muon's assumption of isotropic optimization landscape is suboptimal for DNNs where curvature spectrum is heavy-tailed and ill-conditioned. Muon risks amplifying instabilities in high-curvature directions while limiting progress in flat directions.Method: Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics from Shampoo. It formulates spectral steepest descent constrained by an anisotropic trust region, deriving optimal updates via polar decomposition of whitened gradient.
Result: Empirical results across language models (160M to 800M parameters) show Mousse consistently outperforms Muon, achieving ~12% reduction in training steps with negligible computational overhead.
Conclusion: Mousse successfully reconciles structural stability of spectral methods with geometric adaptivity of second-order preconditioning, providing more efficient optimization for modern neural networks.
Abstract: Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this “egalitarian” constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo’s \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12% reduction in training steps with negligible computational overhead.
[490] Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework
Engin Deniz Erkan, Elif Surer, Ulas Yaman
Main category: cs.LG
TL;DR: A data-driven framework using MLP regressor and conditional GAN to predict surface roughness (Ra) in material extrusion 3D printing based on process parameters and surface inclination, with a web interface for interactive process planning.
Details
Motivation: Surface roughness in 3D printing varies across parts and is hard to predict due to complex dependencies on printing parameters and local surface inclination (staircase effect), making process planning challenging.Method: Created structured experimental dataset (87 specimens, 1566 Ra measurements), trained multilayer perceptron regressor to capture nonlinear relationships, used conditional GAN to generate additional synthetic data, and developed web interface for interactive visualization.
Result: Developed predictive model for Ra based on manufacturing conditions and surface inclination, with improved performance using GAN-generated synthetic data, and created interactive web tool for process planning with roughness visualization.
Conclusion: The framework enables accurate prediction of surface roughness prior to fabrication and provides interactive tools for optimizing printing parameters and part orientation to minimize roughness in material extrusion additive manufacturing.
Abstract: Surface roughness in Material Extrusion Additive Manufacturing varies across a part and is difficult to anticipate during process planning because it depends on both printing parameters and local surface inclination, which governs the staircase effect. A data-driven framework is presented to predict the arithmetic mean roughness (Ra) prior to fabrication using process parameters and surface angle. A structured experimental dataset was created using a three-level Box-Behnken design: 87 specimens were printed, each with multiple planar faces spanning different inclination angles, yielding 1566 Ra measurements acquired with a contact profilometer. A multilayer perceptron regressor was trained to capture nonlinear relationships between manufacturing conditions, inclination, and Ra. To mitigate limited experimental data, a conditional generative adversarial network was used to generate additional condition-specific tabular samples, thereby improving predictive performance. Model performance was assessed on a hold-out test set. A web-based decision-support interface was also developed to enable interactive process planning by loading a 3D model, specifying printing parameters, and adjusting the part’s orientation. The system computes face-wise inclination from the model geometry and visualizes predicted Ra as an interactive colormap over the surface, enabling rapid identification of regions prone to high roughness and immediate comparison of parameter and orientation choices.
[491] Democratising Clinical AI through Dataset Condensation for Classical Clinical Models
Anshul Thakur, Soheila Molaei, Pafue Christy Nganjimi, Joshua Fieggen, Andrew A. S. Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton
Main category: cs.LG
TL;DR: A differentially private, zero-order optimization framework for dataset condensation that works with non-differentiable clinical models like decision trees and Cox regression, enabling privacy-preserving data sharing for healthcare applications.
Details
Motivation: Dataset condensation has potential for healthcare data democratization when paired with differential privacy, but existing methods rely on differentiable neural networks and are incompatible with widely used clinical models like decision trees and Cox regression.Method: Proposes a differentially private, zero-order optimization framework that extends dataset condensation to non-differentiable models using only function evaluations, without requiring gradient information.
Result: Empirical results across six datasets (including classification and survival tasks) show the method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees.
Conclusion: Enables model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information, bridging the gap between dataset condensation and practical healthcare applications.
Abstract: Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.
[492] From Representation to Clusters: A Contrastive Learning Approach for Attributed Hypergraph Clustering
Li Ni, Shuaikang Zeng, Lin Mu, Longlong Lin
Main category: cs.LG
TL;DR: CAHC is an end-to-end contrastive learning method for attributed hypergraph clustering that jointly learns node embeddings and obtains clustering results through representation learning and cluster assignment learning.
Details
Motivation: Existing contrastive learning methods for attributed hypergraph clustering first learn node embeddings then apply clustering algorithms separately, lacking direct clustering supervision and risking inclusion of clustering-irrelevant information in learned representations.Method: CAHC uses two main steps: 1) Representation learning with novel contrastive approach incorporating both node-level and hyperedge-level objectives, 2) Cluster assignment learning that jointly optimizes embedding and clustering with clustering-oriented guidance to obtain results simultaneously.
Result: Extensive experiments show CAHC outperforms baselines on eight datasets, demonstrating superior performance in attributed hypergraph clustering.
Conclusion: CAHC provides an effective end-to-end framework for attributed hypergraph clustering that integrates representation learning and clustering optimization, addressing limitations of existing two-stage approaches.
Abstract: Contrastive learning has demonstrated strong performance in attributed hypergraph clustering. Typically, existing methods based on contrastive learning first learn node embeddings and then apply clustering algorithms, such as k-means, to these embeddings to obtain the clustering results.However, these methods lack direct clustering supervision, risking the inclusion of clustering-irrelevant information in the learned graph.To this end, we propose a Contrastive learning approach for Attributed Hypergraph Clustering (CAHC), an end-to-end method that simultaneously learns node embeddings and obtains clustering results. CAHC consists of two main steps: representation learning and cluster assignment learning. The former employs a novel contrastive learning approach that incorporates both node-level and hyperedge-level objectives to generate node embeddings.The latter joint embedding and clustering optimization to refine these embeddings by clustering-oriented guidance and obtains clustering results simultaneously.Extensive experimental results demonstrate that CAHC outperforms baselines on eight datasets.
[493] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning
Yiyang Lu, Yu He, Jianlong Chen, Hongyuan Zha
Main category: cs.LG
TL;DR: MSSR: A memory-inspired replay framework for continual fine-tuning of LLMs that schedules rehearsal at adaptive intervals based on estimated sample-level memory strength to mitigate catastrophic forgetting while maintaining fast adaptation.
Details
Motivation: Continual fine-tuning of LLMs in dynamic environments leads to catastrophic forgetting of previously learned skills. Existing replay strategies are limited by heuristic rules, partial forgetting mitigation, or high computational overhead.Method: Proposes Memory-Inspired Sampler and Scheduler Replay (MSSR) that estimates sample-level memory strength and schedules rehearsal at adaptive intervals based on retention dynamics under sequential fine-tuning.
Result: Extensive experiments across 3 backbone models and 11 sequential tasks show MSSR consistently outperforms state-of-the-art replay baselines, with strong gains on reasoning-intensive and multiple-choice benchmarks.
Conclusion: MSSR effectively addresses catastrophic forgetting in continual LLM fine-tuning through adaptive rehearsal scheduling based on memory strength estimation, offering better performance than existing replay methods.
Abstract: Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.
[494] SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space
Swaminathan S K, Aritra Hazra
Main category: cs.LG
TL;DR: SPAARS is a curriculum learning framework for offline-to-online RL that starts with safe latent-space exploration using CVAEs, then transitions to raw action space to bypass decoder bottlenecks, achieving better performance and sample efficiency.
Details
Motivation: Offline-to-online RL for robotics needs safe online exploration without deviating from offline data support. Existing CVAE-based methods suffer from an exploitation gap due to decoder reconstruction loss limiting performance.Method: SPAARS uses curriculum learning: initially constrains exploration to low-dimensional latent manifold for safe behavioral improvement, then transfers to raw action space. Two variants: CVAE-based (unordered pairs) and SPAARS-SUPE (with OPAL temporal skill pretraining).
Result: SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 vs 0.75 for SUPE with 5x better sample efficiency. Standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2, surpassing IQL baselines.
Conclusion: SPAARS effectively bridges offline-to-online RL by combining safe latent exploration with raw action space transfer, overcoming decoder bottlenecks while maintaining safety and improving sample efficiency.
Abstract: Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap – a performance ceiling imposed by the decoder’s reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.
[495] Reconstructing Movement from Sparse Samples: Enhanced Spatio-Temporal Matching Strategies for Low-Frequency Data
Ali Yousefian, Arianna Burzacchi, Simone Vantini
Main category: cs.LG
TL;DR: Proposes four enhancements to Spatial-Temporal Matching algorithm for GPS trajectory-to-road network matching, improving computational efficiency and accuracy in dense urban environments.
Details
Motivation: The original Spatial-Temporal Matching algorithm has limitations in computational efficiency and accuracy, especially in dense urban environments with high sampling intervals, which affects GPS trajectory matching to road networks.Method: Four modifications: 1) dynamic buffer for spatial constraints, 2) adaptive observation probability, 3) redesigned temporal scoring function, and 4) behavioral analysis incorporating historical mobility patterns.
Result: Experimental evaluation using real-world data from Milan shows significant improvements in both performance efficiency and path quality across various metrics, validated through newly defined evaluation metrics.
Conclusion: The proposed enhancements effectively address limitations of the original algorithm, providing better computational efficiency and matching accuracy for GPS trajectory-to-road network mapping in urban environments.
Abstract: This paper explores potential improvements to the Spatial-Temporal Matching algorithm for matching the GPS trajectories to road networks. While this algorithm is effective, it presents some limitations in computational efficiency and the accuracy of the results, especially in dense environments with relatively high sampling intervals. To address this, the paper proposes four modifications to the original algorithm: a dynamic buffer, an adaptive observation probability, a redesigned temporal scoring function, and a behavioral analysis to account for the historical mobility patterns. The enhancements are assessed using real-world data from the urban area of Milan, and through newly defined evaluation metrics to be applied in the absence of ground truth. The results of the experiment show significant improvements in performance efficiency and path quality across various metrics.
[496] Impact of Markov Decision Process Design on Sim-to-Real Reinforcement Learning
Tatjana Krau, Jorge Mandlmaier, Tobias Damm, Frieder Heieck
Main category: cs.LG
TL;DR: Systematic analysis of MDP design choices for RL in industrial process control, focusing on sim-to-real transfer using color mixing task with physics-based models achieving 50% real-world success.
Details
Motivation: RL shows promise for industrial process control but suffers from sim-to-real gap when deployed on physical hardware; need to understand how MDP design choices affect transfer performance.Method: Systematically analyze core MDP design choices (state composition, target inclusion, reward formulation, termination criteria, environment dynamics models) using color mixing task; evaluate different MDP configurations across simulation and real-world experiments; validate on physical hardware.
Result: Physics-based dynamics models achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely; provides practical MDP design guidelines for RL deployment in industrial process control.
Conclusion: Careful MDP design choices significantly impact sim-to-real transfer; physics-based models outperform simplified models; provides guidelines for deploying RL in industrial process control applications.
Abstract: Reinforcement Learning (RL) has demonstrated strong potential for industrial process control, yet policies trained in simulation often suffer from a significant sim-to-real gap when deployed on physical hardware. This work systematically analyzes how core Markov Decision Process (MDP) design choices – state composition, target inclusion, reward formulation, termination criteria, and environment dynamics models – affect this transfer. Using a color mixing task, we evaluate different MDP configurations and mixing dynamics across simulation and real-world experiments. We validate our findings on physical hardware, demonstrating that physics-based dynamics models achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Our results provide practical MDP design guidelines for deploying RL in industrial process control.
[497] From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation
Rong J. B. Zhu
Main category: cs.LG
TL;DR: Proposes Nonparametric Weighting (NW) and Model-assisted Nonparametric Weighting (MNW) methods for off-policy evaluation in contextual bandits to reduce variance while maintaining low bias.
Details
Motivation: Address limitations of inverse probability weighting (IPW) which suffers from high variance due to probability in denominator, and doubly robust (DR) estimator which reduces variance through reward modeling but doesn't directly address IPW variance.Method: NW constructs weights using nonparametric model to achieve low bias like IPW with lower variance. MNW incorporates reward predictions similar to DR technique to further reduce variance, explicitly modeling and mitigating bias from reward modeling without guaranteeing standard doubly robust property.
Result: Extensive empirical comparisons show proposed approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.
Conclusion: Nonparametric weighting approaches provide effective solutions for off-policy evaluation in contextual bandits by addressing variance issues of traditional methods while preserving accuracy.
Abstract: We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further reduce variance, we incorporate reward predictions – similar to the DR technique – resulting in the Model-assisted Nonparametric Weighting (MNW) approach. The MNW approach yields accurate value estimates by explicitly modeling and mitigating bias from reward modeling, without aiming to guarantee the standard doubly robust property. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.
[498] Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers
Albus Yizhuo Li, Matthew Wicker
Main category: cs.LG
TL;DR: VMoER introduces a Bayesian approach for uncertainty quantification in Mixture-of-Experts layers of foundation models, improving calibration and robustness with minimal computational overhead.
Details
Motivation: Foundation models need uncertainty quantification for responsible deployment, but Bayesian methods are computationally impractical at scale. Current models use sparse architectures like MoE layers, creating an opportunity to incorporate uncertainty-aware routing.Method: VMoER applies Bayesian inference only to the expert-selection stage of MoE layers. Two inference strategies: 1) amortized variational inference over routing logits, and 2) inferring temperature parameter for stochastic expert selection.
Result: VMoER improves routing stability under noise by 38%, reduces calibration error by 94%, increases out-of-distribution AUROC by 12%, with less than 1% additional FLOPs across tested foundation models.
Conclusion: VMoER provides a scalable path for uncertainty-aware foundation models by confining Bayesian inference to routing decisions, achieving significant improvements in calibration and robustness with minimal computational cost.
Abstract: Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across tested foundation models, VMoER improves routing stability under noise by 38%, reduces calibration error by 94%, and increases out-of-distribution AUROC by 12%, while incurring less than 1% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.
[499] Temporal-Conditioned Normalizing Flows for Multivariate Time Series Anomaly Detection
David Baumgartner, Helge Langseth, Kenth Engø-Monsen, Heri Ramampiaro
Main category: cs.LG
TL;DR: Temporal-conditioned normalizing flows (tcNF) for anomaly detection in time series data using autoregressive normalizing flows conditioned on previous observations to model temporal dependencies and uncertainty.
Details
Motivation: Addressing the challenge of anomaly detection in time series data by accurately modeling temporal dependencies and uncertainty, which existing methods often fail to capture effectively.Method: Uses normalizing flows conditioned on previous observations (autoregressive approach) to capture complex temporal dynamics and generate accurate probability distributions of expected behavior for anomaly detection.
Result: Demonstrates good accuracy and robustness on diverse datasets compared to existing methods, with comprehensive analysis of strengths and limitations provided.
Conclusion: tcNF provides an effective framework for time series anomaly detection with accurate temporal dependency modeling, and open-source code is provided for reproducibility and future research.
Abstract: This paper introduces temporal-conditioned normalizing flows (tcNF), a novel framework that addresses anomaly detection in time series data with accurate modeling of temporal dependencies and uncertainty. By conditioning normalizing flows on previous observations, tcNF effectively captures complex temporal dynamics and generates accurate probability distributions of expected behavior. This autoregressive approach enables robust anomaly detection by identifying low-probability events within the learned distribution. We evaluate tcNF on diverse datasets, demonstrating good accuracy and robustness compared to existing methods. A comprehensive analysis of strengths and limitations and open-source code is provided to facilitate reproducibility and future research.
[500] Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation
Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji
Main category: cs.LG
TL;DR: EDA framework efficiently adapts draft models for speculative decoding on fine-tuned target LLMs without full retraining
Details
Motivation: Speculative decoding performance degrades when target models are fine-tuned for specific domains, and retraining draft models for every target model is costly and inefficientMethod: Three innovations: (1) decoupled architecture with shared/private components for parameter-efficient adaptation, (2) data regeneration using target model to improve alignment, (3) sample selection mechanism for high-value data
Result: EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining
Conclusion: EDA provides a parameter- and data-efficient framework for adapting draft models to fine-tuned target models, making speculative decoding practical for domain-specific applications
Abstract: Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.
[501] Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference
Cosmo Santoni
Main category: cs.LG
TL;DR: Mamba-2 state-space models can be efficiently implemented using XLA compilation without custom CUDA kernels, achieving hardware portability across CPU, GPU, and TPU while maintaining performance.
Details
Motivation: Current Mamba-2 implementations rely on fused CUDA/Triton kernels that create NVIDIA hardware dependencies, limiting portability to other platforms like TPUs.Method: Implement Mamba-2’s state-space duality algorithm using XLA’s standard primitives and optimization passes, leveraging diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow that maps well to XLA’s fusion and tiling capabilities.
Result: Achieves ~140 TFLOPS on TPU v6e prefill (15% MFU) and up to 64% bandwidth utilization on decode, with token-for-token matching to PyTorch/CUDA reference and hidden-state agreement within float32 tolerance across 64 steps.
Conclusion: Mamba-2 state-space models can be efficiently implemented without custom kernels using XLA, enabling hardware portability while maintaining performance, with the pattern generalizable to any SSM recurrence meeting similar structural conditions.
Abstract: State-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2’s state space duality algorithm – diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow – maps cleanly onto what XLA’s fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture’s theoretical $O(1)$ state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M–2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ($15%$ MFU) and up to $64%$ bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at https://github.com/CosmoNaught/mamba2-jax and merged into the Bonsai JAX model library.
[502] Learning Bayesian and Markov Networks with an Unreliable Oracle
Juha Harviainen, Pekka Parviainen, Vidya Sagar Sharma
Main category: cs.LG
TL;DR: Paper studies constraint-based structure learning of Markov and Bayesian networks with unreliable conditional independence oracles that make bounded errors, showing different identifiability results for each type of network.
Details
Motivation: The motivation is to understand how constraint-based structure learning algorithms perform when the conditional independence tests (oracle) are unreliable and can make errors, which is a realistic scenario in practice where statistical tests are imperfect.Method: The paper uses theoretical analysis to study identifiability conditions for Markov networks and Bayesian networks under bounded oracle errors. For Markov networks, they analyze conditions based on vertex-wise disjoint paths, while for Bayesian networks they examine impossibility results even with bounded graph parameters like treewidth.
Result: For Markov networks, they show that structures are uniquely identifiable even with moderately exponential errors in the number of vertices when there’s a low maximum number of vertex-wise disjoint paths. For Bayesian networks, they prove that no errors can be tolerated to always identify the structure, even when common graph parameters are bounded. They also provide algorithms for structure learning when the structure is uniquely identifiable.
Conclusion: The paper establishes fundamental differences between Markov and Bayesian networks in terms of error tolerance for structure learning, with Markov networks being more robust to oracle errors under certain conditions, while Bayesian networks require perfect oracle responses for guaranteed structure identification.
Abstract: We study constraint-based structure learning of Markov networks and Bayesian networks in the presence of an unreliable conditional independence oracle that makes at most a bounded number of errors. For Markov networks, we observe that a low maximum number of vertex-wise disjoint paths implies that the structure is uniquely identifiable even if the number of errors is (moderately) exponential in the number of vertices. For Bayesian networks, however, we prove that one cannot tolerate any errors to always identify the structure even when many commonly used graph parameters like treewidth are bounded. Finally, we give algorithms for structure learning when the structure is uniquely identifiable.
[503] An Optimal Control Approach To Transformer Training
Kağan Akman, Naci Saldı, Serdar Yüksel
Main category: cs.LG
TL;DR: A control-theoretic framework for Transformer training using particle systems and measure lifting to ensure structural constraints like input independence and positional encoding.
Details
Motivation: To develop a rigorous optimal control approach for Transformer training that respects key architectural constraints: realized-input-independence during execution, ensemble control nature, and positional dependence, providing a globally optimal alternative to gradient-based methods.Method: Model Transformer as discrete-time controlled particle system with shared actions (noise-free McKean-Vlasov dynamics), lift to probability measures to create Markov decision process, incorporate positional encodings, use dynamic programming for optimal policies, propose triply quantized training (state space, probability measures, action space).
Result: Established existence of globally optimal policies under compactness assumptions, proved equivalence between closed-loop policies in lifted space and initial-distribution dependent open-loop policies, showed triply quantized policies are near-optimal, demonstrated stability and empirical consistency with value function continuity.
Conclusion: Provides a globally optimal and robust alternative to gradient-based Transformer training without requiring smoothness or convexity, with theoretical guarantees for structural constraints and convergence properties.
Abstract: In this paper, we develop a rigorous optimal control-theoretic approach to Transformer training that respects key structural constraints such as (i) realized-input-independence during execution, (ii) the ensemble control nature of the problem, and (iii) positional dependence. We model the Transformer architecture as a discrete-time controlled particle system with shared actions, exhibiting noise-free McKean-Vlasov dynamics. While the resulting dynamics is not Markovian, we show that lifting it to probability measures produces a fully-observed Markov decision process (MDP). Positional encodings are incorporated into the state space to preserve the sequence order under lifting. Using the dynamic programming principle, we establish the existence of globally optimal policies under mild assumptions of compactness. We further prove that closed-loop policies in the lifted is equivalent to an initial-distribution dependent open-loop policy, which are realized-input-independent and compatible with standard Transformer training. To train a Transformer, we propose a triply quantized training procedure for the lifted MDP by quantizing the state space, the space of probability measures, and the action space, and show that any optimal policy for the triply quantized model is near-optimal for the original training problem. Finally, we establish stability and empirical consistency properties of the lifted model by showing that the value function is continuous with respect to the perturbations of the initial empirical measures and convergence of policies as the data size increases. This approach provides a globally optimal and robust alternative to gradient-based training without requiring smoothness or convexity.
[504] Routing without Forgetting
Alessio Masano, Giovanni Bellitto, Dipam Goswani, Joost Van de Weijer, Concetto Spampinato
Main category: cs.LG
TL;DR: Routing without Forgetting (RwF) introduces energy-based associative retrieval layers in transformers for online continual learning, enabling dynamic routing of inputs to appropriate representational subspaces without gradient-based specialization.
Details
Motivation: Current parameter-efficient adaptation methods (prompts, adapters, LoRA) for continual learning in transformers rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL) where data arrives as a non-stationary stream with single-pass observations.Method: RwF augments transformers with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing task-specific prompts, it generates dynamic prompts through single-step associative retrieval over transformer token embeddings at each layer, using closed-form minimization of a strictly convex free-energy functional for input-conditioned routing.
Result: RwF outperforms prior prompt-based methods on challenging class-incremental benchmarks. On Split-ImageNet-R and Split-ImageNet-S, it achieves large margins of improvement, even in few-shot learning regimes.
Conclusion: Embedding energy-based associative routing directly within transformer backbones provides a principled and effective foundation for Online Continual Learning, enabling dynamic routing without gradient refinement.
Abstract: Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.
[505] Towards Understanding Adam Convergence on Highly Degenerate Polynomials
Zhiwei Bai, Jiajie Zhao, Zhangchen Zhou, Zhi-Qin John Xu, Yaoyu Zhang
Main category: cs.LG
TL;DR: Adam exhibits natural auto-convergence on highly degenerate polynomials without external schedulers, achieving local linear convergence that outperforms Gradient Descent and Momentum’s sub-linear convergence.
Details
Motivation: While Adam is widely used in deep learning, the specific class of objective functions where it has inherent advantages remains underexplored. Previous studies required external schedulers and β₂ near 1 for convergence, but this work investigates Adam's "natural" auto-convergence properties without such requirements.Method: The authors identify a class of highly degenerate polynomials where Adam converges automatically. They derive theoretical conditions for local asymptotic stability on these degenerate polynomials and demonstrate alignment between theoretical bounds and experimental results. They analyze the decoupling mechanism between second moment v_t and squared gradient g_t² that enables exponential amplification of effective learning rate.
Result: Adam achieves local linear convergence on degenerate functions, significantly outperforming Gradient Descent and Momentum’s sub-linear convergence. The paper characterizes Adam’s hyperparameter phase diagram with three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.
Conclusion: Adam has inherent auto-convergence properties on highly degenerate polynomials without requiring external schedulers, with a decoupling mechanism that exponentially amplifies effective learning rate, leading to superior convergence compared to traditional optimization methods.
Abstract: Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and $β_2$ near 1 for convergence, this work investigates the “natural” auto-convergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment $v_t$ and squared gradient $g_t^2$, which exponentially amplifies the effective learning rate. Finally, we characterize Adam’s hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.
[506] Nonparametric Variational Differential Privacy via Embedding Parameter Clipping
Dina El Zein, Shashi Kumar, James Henderson
Main category: cs.LG
TL;DR: A principled parameter clipping strategy for nonparametric variational information bottleneck models to improve privacy-utility trade-off by preventing learned latent representations from drifting into high-information regions.
Details
Motivation: The nonparametric variational information bottleneck (NVIB) enables nonparametric variational differential privacy (NVDP) for privacy-preserving language models, but learned latent representations can drift into regions with high information content, leading to poor privacy guarantees and low utility due to numerical instability during training.Method: Introduces a mathematically derived parameter clipping strategy from the objective of minimizing the Rényi Divergence upper bound, yielding specific, theoretically grounded constraints on posterior mean, variance, and mixture weight parameters in NVIB-based models.
Result: The clipped model consistently achieves tighter Rényi Divergence bounds (implying stronger privacy) while simultaneously attaining higher performance on several downstream tasks compared to an unconstrained baseline.
Conclusion: Presents a simple yet effective method for improving the privacy-utility trade-off in variational models, making them more robust and practical for privacy-preserving language modeling.
Abstract: The nonparametric variational information bottleneck (NVIB) provides the foundation for nonparametric variational differential privacy (NVDP), a framework for building privacy-preserving language models. However, the learned latent representations can drift into regions with high information content, leading to poor privacy guarantees, but also low utility due to numerical instability during training. In this work, we introduce a principled parameter clipping strategy to directly address this issue. Our method is mathematically derived from the objective of minimizing the Rényi Divergence (RD) upper bound, yielding specific, theoretically grounded constraints on the posterior mean, variance, and mixture weight parameters. We apply our technique to an NVIB based model and empirically compare it against an unconstrained baseline. Our findings demonstrate that the clipped model consistently achieves tighter RD bounds, implying stronger privacy, while simultaneously attaining higher performance on several downstream tasks. This work presents a simple yet effective method for improving the privacy-utility trade-off in variational models, making them more robust and practical.
[507] Memorization capacity of deep ReLU neural networks characterized by width and depth
Xin Yang, Yunfei Yang
Main category: cs.LG
TL;DR: Deep neural networks with ReLU activation require width-depth trade-off W²L² = Θ(N log(δ⁻¹)) to memorize N data points with separation δ, achieving optimal memorization capacity.
Details
Motivation: To understand the fundamental memorization capacity of deep neural networks with ReLU activation, specifically characterizing the minimal network size needed to memorize arbitrary data points with separation constraints, going beyond prior work that only considered parameter or neuron counts.Method: Constructs neural networks with specific width W and depth L satisfying W²L² = O(N log(δ⁻¹)) that can memorize any N data samples in the unit ball with pairwise separation δ. Also proves a matching lower bound showing W²L² = Ω(N log(δ⁻¹)) is necessary.
Result: Establishes optimal width-depth trade-off W²L² = Θ(N log(δ⁻¹)) for memorization capacity, showing construction is optimal up to logarithmic factors when δ⁻¹ is polynomial in N.
Conclusion: Explicitly characterizes the fundamental trade-off between width and depth for memorization capacity of deep ReLU networks, providing optimal construction and matching lower bounds.
Abstract: This paper studies the memorization capacity of deep neural networks with ReLU activation. Specifically, we investigate the minimal size of such networks to memorize any $N$ data points in the unit ball with pairwise separation distance $δ$ and discrete labels. Most prior studies characterize the memorization capacity by the number of parameters or neurons. We generalize these results by constructing neural networks, whose width $W$ and depth $L$ satisfy $W^2L^2= \mathcal{O}(N\log(δ^{-1}))$, that can memorize any $N$ data samples. We also prove that any such networks should also satisfy the lower bound $W^2L^2=Ω(N \log(δ^{-1}))$, which implies that our construction is optimal up to logarithmic factors when $δ^{-1}$ is polynomial in $N$. Hence, we explicitly characterize the trade-off between width and depth for the memorization capacity of deep neural networks in this regime.
[508] MM-algorithms for traditional and convex NMF with Tweedie and Negative Binomial cost functions and empirical evaluation
Elisabeth Sommer James, Asger Hobolth, Marta Pelizzola
Main category: cs.LG
TL;DR: A unified NMF framework with flexible noise models (Negative Binomial, Tweedie) using Majorize-Minimization approach, with novel updates for convex NMF and applications to mutational/word count data.
Details
Motivation: Standard NMF formulations assume Gaussian/Poisson noise, which may not fit data with overdispersion or complex mean-variance relationships. Need for more flexible distributional assumptions in NMF.Method: Develop unified framework for traditional and convex NMF under broad class of distributions (Negative Binomial, Tweedie). Use Majorize-Minimization approach to derive multiplicative update rules, including novel updates for convex NMF with Poisson and Negative Binomial cost functions.
Result: Empirical evaluations show noise model choice critically affects model fit and feature recovery. Convex NMF provides efficient alternative when number of classes is large. Implementation available in R package nmfgenr.
Conclusion: Flexible noise modeling in NMF is important for real-world data. Convex NMF can be robust alternative to traditional NMF in certain scenarios.
Abstract: Non-negative matrix factorisation (NMF) is a widely used tool for unsupervised learning and feature extraction, with applications ranging from genomics to text analysis and signal processing. Standard formulations of NMF are typically derived under Gaussian or Poisson noise assumptions, which may be inadequate for data exhibiting overdispersion or other complex mean-variance relationships. In this paper, we develop a unified framework for both traditional and convex NMF under a broad class of distributional assumptions, including Negative Binomial and Tweedie models, where the connection between the Tweedie and the $β$-divergence is also highlighted. Using a Majorize-Minimisation approach, we derive multiplicative update rules for all considered models, and novel updates for convex NMF with Poisson and Negative Binomial cost functions. We provide a unified implementation of all considered models, including the first implementations of several convex NMF models. Empirical evaluations on mutational and word count data demonstrate that the choice of noise model critically affects model fit and feature recovery, and that convex NMF can provide an efficient and robust alternative to traditional NMF in scenarios where the number of classes is large. The code for our proposed updates is available in the R package nmfgenr and can be found at https://github.com/MartaPelizzola/nmfgenr.
[509] Learning the Hierarchical Organization in Brain Network for Brain Disorder Diagnosis
Jingfeng Tang, Peng Cao, Guangqi Wen, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane
Main category: cs.LG
TL;DR: BrainHO learns hierarchical brain network dependencies from fMRI data without predefined sub-networks, using hierarchical attention and constraints to capture cross-network interactions for brain disorder diagnosis.
Details
Motivation: Existing fMRI brain network analysis relies on predefined functional sub-networks, which fails to capture many cross-network interaction patterns with high correlations. There's a need to learn hierarchical dependencies based on intrinsic features rather than predefined labels.Method: Proposes Brain Hierarchical Organization Learning (BrainHO) with hierarchical attention mechanism to aggregate nodes into hierarchical organization, capturing intricate connectivity patterns at subgraph level. Uses orthogonality constraint loss and hierarchical consistency constraint strategy to refine node-level features with high-level graph semantics.
Result: Extensive experiments on ABIDE and REST-meta-MDD datasets show BrainHO achieves state-of-the-art classification performance and uncovers interpretable, clinically significant biomarkers by precisely localizing disease-related sub-networks.
Conclusion: BrainHO effectively learns hierarchical brain network dependencies without predefined sub-networks, capturing cross-network interactions for improved brain disorder diagnosis and biomarker discovery.
Abstract: Brain network analysis based on functional Magnetic Resonance Imaging (fMRI) is pivotal for diagnosing brain disorders. Existing approaches typically rely on predefined functional sub-networks to construct sub-network associations. However, we identified many cross-network interaction patterns with high Pearson correlations that this strict, prior-based organization fails to capture. To overcome this limitation, we propose the Brain Hierarchical Organization Learning (BrainHO) to learn inherently hierarchical brain network dependencies based on their intrinsic features rather than predefined sub-network labels. Specifically, we design a hierarchical attention mechanism that allows the model to aggregate nodes into a hierarchical organization, effectively capturing intricate connectivity patterns at the subgraph level. To ensure diverse, complementary, and stable organizations, we incorporate an orthogonality constraint loss, alongside a hierarchical consistency constraint strategy, to refine node-level features using high-level graph semantics. Extensive experiments on the publicly available ABIDE and REST-meta-MDD datasets demonstrate that BrainHO not only achieves state-of-the-art classification performance but also uncovers interpretable, clinically significant biomarkers by precisely localizing disease-related sub-networks.
[510] Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs
Ali Sadeghkhani, A. Assadi, B. Bennett, A. Rabbani
Main category: cs.LG
TL;DR: A cGAN framework generates realistic carbonate rock thin section images conditioned on porosity values from well logs, enabling continuous pore-scale visualization along wellbores.
Details
Motivation: Pore-scale imaging of subsurface formations is expensive and limited to discrete depths, creating significant gaps in reservoir characterization that need to be addressed.Method: Conditional Generative Adversarial Network (cGAN) framework trained on 5,000 sub-images from 15 petrography samples over 1992-2000m depth interval, conditioned on porosity values from well logs.
Result: Model generates geologically consistent images across wide porosity range (0.004-0.745) with 81% accuracy within 10% margin of target porosity values, enabling continuous pore-scale visualization.
Conclusion: The integration of well log data with trained generator bridges gaps between discrete core sampling points, providing valuable insights for reservoir characterization and energy transition applications.
Abstract: Pore-scale imaging of subsurface formations is costly and limited to discrete depths, creating significant gaps in reservoir characterization. To address this, we present a conditional Generative Adversarial Network (cGAN) framework for synthesizing realistic thin section images of carbonate rock formations, conditioned on porosity values derived from well logs. The model is trained on 5,000 sub-images extracted from 15 petrography samples over a depth interval of 1992-2000m, the model generates geologically consistent images across a wide porosity range (0.004-0.745), achieving 81% accuracy within a 10% margin of target porosity values. The successful integration of well log data with the trained generator enables continuous pore-scale visualization along the wellbore, bridging gaps between discrete core sampling points and providing valuable insights for reservoir characterization and energy transition applications such as carbon capture and underground hydrogen storage.
[511] FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting
Boya Zhang, Shuaijie Yin, Huiwen Zhu, Xing He
Main category: cs.LG
TL;DR: FreqCycle is a time series forecasting framework that addresses the limitation of existing models focusing only on low-frequency patterns by incorporating both low-frequency and mid-to-high frequency feature extraction through specialized modules, with hierarchical extension MFreqCycle for handling coupled multi-periodicity.
Details
Motivation: Existing time series forecasting research predominantly focuses on modeling low-frequency patterns where most energy is concentrated, overlooking mid to high frequency components which limits further performance gains in deep learning models.Method: FreqCycle integrates: (1) Filter-Enhanced Cycle Forecasting (FECF) module to extract low-frequency features by learning shared periodic patterns in time domain, and (2) Segmented Frequency-domain Pattern Learning (SFPL) module to enhance mid-to-high frequency energy proportion via learnable filters and adaptive weighting. MFreqCycle extends this hierarchically to decouple nested periodic features through cross-scale interactions for handling coupled multi-periodicity.
Result: Extensive experiments on seven diverse domain benchmarks demonstrate that FreqCycle achieves state-of-the-art accuracy while maintaining faster inference speeds, striking an optimal balance between performance and efficiency.
Conclusion: The proposed FreqCycle framework effectively addresses the limitation of overlooking mid-to-high frequency patterns in time series forecasting, with hierarchical extension MFreqCycle successfully handling coupled multi-periodicity challenges.
Abstract: Mining time-frequency features is critical for time series forecasting. Existing research has predominantly focused on modeling low-frequency patterns, where most time series energy is concentrated. The overlooking of mid to high frequency continues to limit further performance gains in deep learning models. We propose FreqCycle, a novel framework integrating: (i) a Filter-Enhanced Cycle Forecasting (FECF) module to extract low-frequency features by explicitly learning shared periodic patterns in the time domain, and (ii) a Segmented Frequency-domain Pattern Learning (SFPL) module to enhance mid to high frequency energy proportion via learnable filters and adaptive weighting. Furthermore, time series data often exhibit coupled multi-periodicity, such as intertwined weekly and daily cycles. To address coupled multi-periodicity as well as long lookback window challenges, we extend FreqCycle hierarchically into MFreqCycle, which decouples nested periodic features through cross-scale interactions. Extensive experiments on seven diverse domain benchmarks demonstrate that FreqCycle achieves state-of-the-art accuracy while maintaining faster inference speeds, striking an optimal balance between performance and efficiency.
[512] No evaluation without fair representation : Impact of label and selection bias on the evaluation, performance and mitigation of classification models
Magali Legast, Toon Calders, François Fouss
Main category: cs.LG
TL;DR: Empirical analysis of different bias types (label bias and selection bias subtypes) on ML model evaluation, performance, and bias mitigation effectiveness, using a controlled framework to model fair vs. biased worlds.
Details
Motivation: Different bias types (label bias and selection bias subtypes) impact fair ML evaluation differently, but their comparative effects are understudied. Need better understanding of how specific bias types affect model evaluation, performance, and mitigation method effectiveness.Method: Introduces a biasing and evaluation framework that models fair worlds and their biased counterparts by introducing controlled bias into real-life datasets with low discrimination. Empirically analyzes impact of each bias type independently, avoiding traditional biased test set evaluation.
Result: Identifies factors influencing bias impact on model performance, shows no trade-off between fairness and accuracy when evaluated on unbiased test sets, finds bias mitigation effectiveness depends on bias type present in data.
Conclusion: Calls for more accurate model and fairness intervention evaluations, better understanding of complex bias scenarios, and investigation of dataset characteristics affecting mitigation method efficiency.
Abstract: Bias can be introduced in diverse ways in machine learning datasets, for example via selection or label bias. Although these bias types in themselves have an influence on important aspects of fair machine learning, their different impact has been understudied. In this work, we empirically analyze the effect of label bias and several subtypes of selection bias on the evaluation of classification models, on their performance, and on the effectiveness of bias mitigation methods. We also introduce a biasing and evaluation framework that allows to model fair worlds and their biased counterparts through the introduction of controlled bias in real-life datasets with low discrimination. Using our framework, we empirically analyze the impact of each bias type independently, while obtaining a more representative evaluation of models and mitigation methods than with the traditional use of a subset of biased data as test set. Our results highlight different factors that influence how impactful bias is on model performance. They also show an absence of trade-off between fairness and accuracy, and between individual and group fairness, when models are evaluated on a test set that does not exhibit unwanted bias. They furthermore indicate that the performance of bias mitigation methods is influenced by the type of bias present in the data. Our findings call for future work to develop more accurate evaluations of prediction models and fairness interventions, but also to better understand other types of bias, more complex scenarios involving the combination of different bias types, and other factors that impact the efficiency of the mitigation methods, such as dataset characteristics.
[513] GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation
Federico Bello, Gonzalo Chiarlone, Marcelo Fiori, Gastón García González, Federico Larroca
Main category: cs.LG
TL;DR: A framework for evaluating graph neural networks in time series anomaly detection with standardized evaluation and interpretability analysis.
Details
Motivation: There's growing interest in applying GNNs to time series anomaly detection, but the field lacks standardized evaluation frameworks and suffers from metric design issues, making comparisons difficult.Method: Developed an open-source framework for TSAD using GNNs that supports reproducible experimentation across datasets, graph structures, and evaluation strategies. Evaluated several GNN-based architectures alongside baselines on real-world datasets with contrasting structural characteristics.
Result: GNNs improve detection performance and offer significant gains in interpretability. Attention-based GNNs show robustness when graph structure is uncertain. The framework reveals how certain metrics and thresholding strategies can obscure meaningful comparisons.
Conclusion: The work provides practical tools and critical insights to advance graph-based TSAD systems, emphasizing the importance of standardized evaluation and the value of interpretability in practical applications.
Abstract: There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.
[514] On Catastrophic Forgetting in Low-Rank Decomposition-Based Parameter-Efficient Fine-Tuning
Muhammad Ahmad, Jingjing Zheng, Yankai Cao
Main category: cs.LG
TL;DR: Empirical study shows that forgetting in sequential learning with PEFT methods depends on update subspace geometry and parameterization, with tensor-based decompositions mitigating forgetting better than shared matrix subspaces.
Details
Motivation: To understand the behavior of parameter-efficient fine-tuning (PEFT) methods like LoRA in sequential learning, specifically regarding catastrophic forgetting, which remains insufficiently understood despite PEFT becoming standard for adapting large pretrained models.Method: Empirical study analyzing how forgetting is influenced by the geometry and parameterization of the update subspace. Examines methods that restrict updates to small, shared matrix subspaces (which suffer from task interference) versus tensor-based decompositions like LoRETTA and structurally aligned parameterizations like WeGeFT.
Result: Findings show that tensor-based decompositions mitigate forgetting by capturing richer structural information within ultra-compact budgets, and structurally aligned parameterizations preserve pretrained representations better than shared matrix subspace methods.
Conclusion: Update subspace design is a key factor in continual learning, offering practical guidance for selecting efficient adaptation strategies in sequential settings. The geometry and parameterization of the update subspace strongly influence catastrophic forgetting in PEFT methods.
Abstract: Parameter-efficient fine-tuning (PEFT) based on low-rank decomposition, such as LoRA, has become a standard for adapting large pretrained models. However, its behavior in sequential learning – specifically regarding catastrophic forgetting – remains insufficiently understood. In this work, we present an empirical study showing that forgetting is strongly influenced by the geometry and parameterization of the update subspace. While methods that restrict updates to small, shared matrix subspaces often suffer from task interference, tensor-based decompositions (e.g., LoRETTA) mitigate forgetting by capturing richer structural information within ultra-compact budgets, and structurally aligned parameterizations (e.g., WeGeFT) preserve pretrained representations. Our findings highlight update subspace design as a key factor in continual learning and offer practical guidance for selecting efficient adaptation strategies in sequential settings.
[515] Physics-informed neural operator for predictive parametric phase-field modelling
Nanxi Chen, Airong Chen, Rujin Ma
Main category: cs.LG
TL;DR: PF-PINO: A physics-informed neural operator framework for accelerating phase-field modeling by embedding physical constraints into training, outperforming conventional Fourier neural operators in accuracy and stability.
Details
Motivation: Phase-field modeling is computationally intensive for high-throughput studies. While neural operators like FNO can accelerate parametric PDE solutions, they lack explicit physical constraints, limiting generalization and long-term accuracy for complex phase-field dynamics.Method: Developed PF-PINO, a physics-informed neural operator framework that learns parametric phase-field PDEs by embedding residuals of governing equations into the data-fidelity loss function to enforce physical constraints during training.
Result: PF-PINO significantly outperforms conventional FNO in accuracy, generalization capability, and long-term stability across benchmark phase-field problems including electrochemical corrosion, dendritic crystal solidification, and spinodal decomposition.
Conclusion: Provides a robust computational tool for phase-field modeling and highlights the potential of physics-informed neural operators to advance scientific machine learning for complex interfacial evolution problems.
Abstract: Predicting the microstructural and morphological evolution of materials through phase-field modelling is computationally intensive, particularly for high-throughput parametric studies. While neural operators such as the Fourier neural operator (FNO) show promise in accelerating the solution of parametric partial differential equations (PDEs), the lack of explicit physical constraints, may limit generalisation and long-term accuracy for complex phase-field dynamics. Here, we develop a physics-informed neural operator framework to learn parametric phase-field PDEs, namely PF-PINO. By embedding the residuals of phase-field governing equations into the data-fidelity loss function, our framework effectively enforces physical constraints during training. We validate PF-PINO against benchmark phase-field problems, including electrochemical corrosion, dendritic crystal solidification, and spinodal decomposition. Our results demonstrate that PF-PINO significantly outperforms conventional FNO in accuracy, generalisation capability, and long-term stability. This work provides a robust and efficient computational tool for phase-field modelling and highlights the potential of physics-informed neural operators to advance scientific machine learning for complex interfacial evolution problems.
[516] A Multi-Prototype-Guided Federated Knowledge Distillation Approach in AI-RAN Enabled Multi-Access Edge Computing System
Luyao Zou, Hayoung Oh, Chu Myaet Thwal, Apurba Adhikary, Seohyeon Hong, Zhu Han
Main category: cs.LG
TL;DR: Proposes MP-FedKD, a multi-prototype-guided federated knowledge distillation approach for AI-RAN enabled MEC systems to handle non-IID data heterogeneity issues.
Details
Motivation: Integration of AI-RAN and MEC can transform network efficiency, but federated learning in such systems faces challenges with non-IID data. Single prototype approaches lose useful information through averaging.Method: Multi-prototype strategy with conditional hierarchical agglomerative clustering (CHAC) and prototype alignment scheme. Integrates self-knowledge distillation into FL and designs novel LEMGP loss function focusing on relationships between global prototypes and local embeddings.
Result: Extensive experiments show MP-FedKD outperforms state-of-the-art baselines in accuracy, average accuracy, and errors (RMSE and MAE) across multiple datasets with various non-IID settings.
Conclusion: MP-FedKD effectively addresses non-IID data heterogeneity in AI-RAN enabled MEC systems through multi-prototype guidance and knowledge distillation.
Abstract: With the development of wireless network, Multi-Access Edge Computing (MEC) and Artificial Intelligence (AI)-native Radio Access Network (RAN) have attracted significant attention. Particularly, the integration of AI-RAN and MEC is envisioned to transform network efficiency and responsiveness. Therefore, it is valuable to investigate AI-RAN enabled MEC system. Federated learning (FL) nowadays is emerging as a promising approach for AI-RAN enabled MEC system, in which edge devices are enabled to train a global model cooperatively without revealing their raw data. However, conventional FL encounters the challenge in processing the non-independent and identically distributed (non-IID) data. Single prototype obtained by averaging the embedding vectors per class can be employed in FL to handle the data heterogeneity issue. Nevertheless, this may result in the loss of useful information owing to the average operation. Therefore, in this paper, a multi-prototype-guided federated knowledge distillation (MP-FedKD) approach is proposed. Particularly, self-knowledge distillation is integrated into FL to deal with the non-IID issue. To cope with the problem of information loss caused by single prototype-based strategy, multi-prototype strategy is adopted, where we present a conditional hierarchical agglomerative clustering (CHAC) approach and a prototype alignment scheme. Additionally, we design a novel loss function (called LEMGP loss) for each local client, where the relationship between global prototypes and local embedding will be focused. Extensive experiments over multiple datasets with various non-IID settings showcase that the proposed MP-FedKD approach outperforms the considered state-of-the-art baselines regarding accuracy, average accuracy and errors (RMSE and MAE).
[517] Upper Generalization Bounds for Neural Oscillators
Zifeng Huang, Konstantin M. Zuev, Yong Xia, Michael Beer
Main category: cs.LG
TL;DR: Theoretical analysis of neural oscillators’ generalization bounds for approximating nonlinear structural systems, showing polynomial error growth with network size and time length.
Details
Motivation: Neural oscillators based on second-order ODEs show empirical success in learning mappings for nonlinear structural systems, but lack theoretical understanding of their generalization capabilities.Method: Derived PAC generalization bounds for neural oscillators using Rademacher complexity framework, analyzing architecture with second-order ODE followed by MLP for approximating causal operators and stable dynamical systems.
Result: Estimation errors grow polynomially with MLP size and time length (avoiding curse of parametric complexity), and constraining MLP Lipschitz constants via regularization improves generalization. Numerical validation on Bouc-Wen nonlinear system confirms theoretical predictions.
Conclusion: Theoretical framework provides generalization guarantees for neural oscillators, showing they can effectively approximate complex nonlinear structural systems with controlled error growth and benefit from Lipschitz regularization.
Abstract: Neural oscillators that originate from the second-order ordinary differential equations (ODEs) have shown competitive performance in learning mappings between dynamic loads and responses of complex nonlinear structural systems. Despite this empirical success, theoretically quantifying the generalization capacities of their neural network architectures remains undeveloped. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper probably approximately correct (PAC) generalization bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating the uniformly asymptotically incrementally stable second-order dynamical systems are derived by leveraging the Rademacher complexity framework. The theoretical results show that the estimation errors grow polynomially with respect to both the MLP size and the time length, thereby avoiding the curse of parametric complexity. Furthermore, the derived error bounds demonstrate that constraining the Lipschitz constants of the MLPs via loss function regularization can improve the generalization ability of the neural oscillator. A numerical study considering a Bouc-Wen nonlinear system under stochastic seismic excitation validates the theoretically predicted power laws of the estimation errors with respect to the sample size and time length, and confirms the effectiveness of constraining MLPs’ matrix and vector norms in enhancing the performance of the neural oscillator under limited training data.
[518] A Hybrid Quantum-Classical Framework for Financial Volatility Forecasting Based on Quantum Circuit Born Machines
Yixiong Chen
Main category: cs.LG
TL;DR: Hybrid quantum-classical framework combining LSTM with Quantum Circuit Born Machine for financial volatility forecasting, showing improved performance over classical LSTM.
Details
Motivation: Financial market volatility forecasting is crucial for risk management but challenging due to non-linear, non-stationary characteristics. Quantum computing offers new paradigms for complex optimization problems.Method: Proposes hybrid quantum-classical framework with LSTM for extracting temporal features from historical data and Quantum Circuit Born Machine (QCBM) as learnable prior module to guide forecasting.
Result: Evaluated on Shanghai Stock Exchange Composite Index and CSI 300 Index high-frequency data. Hybrid model outperforms classical LSTM baseline on MSE, RMSE, and QLIKE loss metrics.
Conclusion: Hybrid quantum-classical approach shows significant advantages for financial forecasting, demonstrating quantum computing’s potential to enhance machine learning models for complex data distributions.
Abstract: Accurate forecasting of financial market volatility is crucial for risk management, option pricing, and portfolio optimization. Traditional econometric models and classical machine learning methods face challenges in handling the inherent non-linear and non-stationary characteristics of financial time series. In recent years, the rapid development of quantum computing has provided a new paradigm for solving complex optimization and sampling problems. This paper proposes a novel hybrid quantum-classical computing framework aimed at combining the powerful representation capabilities of classical neural networks with the unique advantages of quantum models. For the specific task of financial market volatility forecasting, we designed and implemented a hybrid model based on this framework, which combines a Long Short-Term Memory (LSTM) network with a Quantum Circuit Born Machine (QCBM). The LSTM is responsible for extracting complex dynamic features from historical time series data, while the QCBM serves as a learnable prior module, providing the model with a high-quality prior distribution to guide the forecasting process. We evaluated the model on two real financial datasets consisting of 5-minute high-frequency data from the Shanghai Stock Exchange (SSE) Composite Index and CSI 300 Index. Experimental results show that, compared to a purely classical LSTM baseline model, our hybrid quantum-classical model demonstrates significant advantages across multiple key metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and QLIKE loss, proving the great potential of quantum computing in enhancing the capabilities of financial forecasting models. More broadly, the proposed hybrid framework offers a flexible architecture that may be adapted to other machine learning tasks involving high-dimensional, complex, or non-linear data distributions.
[519] Exploiting Label-Aware Channel Scoring for Adaptive Channel Pruning in Split Learning
Jialei Tan, Zheng Lin, Xiangming Cai, Ruoxi Zhu, Zihan Fang, Pingping Chen, Wei Ni
Main category: cs.LG
TL;DR: ACP-SL reduces communication overhead in split learning by adaptively pruning less important channels from intermediate feature representations using label-aware importance scoring.
Details
Motivation: Split learning reduces client computational burden but incurs significant communication overhead from transmitting intermediate feature representations (smashed data), especially with many client devices.Method: Proposes adaptive channel pruning-aided SL (ACP-SL) with: 1) Label-aware channel importance scoring (LCIS) module to generate importance scores distinguishing important vs. less important channels, 2) Adaptive channel pruning (ACP) module to prune less important channels, compressing smashed data and reducing communication overhead.
Result: ACP-SL consistently outperforms benchmark schemes in test accuracy and reaches target test accuracy in fewer training rounds, thereby reducing communication overhead.
Conclusion: ACP-SL effectively addresses communication overhead challenges in split learning while maintaining or improving model performance.
Abstract: Split learning (SL) transfers most of the training workload to the server, which alleviates computational burden on client devices. However, the transmission of intermediate feature representations, referred to as smashed data, incurs significant communication overhead, particularly when a large number of client devices are involved. To address this challenge, we propose an adaptive channel pruning-aided SL (ACP-SL) scheme. In ACP-SL, a label-aware channel importance scoring (LCIS) module is designed to generate channel importance scores, distinguishing important channels from less important ones. Based on these scores, an adaptive channel pruning (ACP) module is developed to prune less important channels, thereby compressing the corresponding smashed data and reducing the communication overhead. Experimental results show that ACP-SL consistently outperforms benchmark schemes in test accuracy. Furthermore, it reaches a target test accuracy in fewer training rounds, thereby reducing communication overhead.
[520] Information Theoretic Bayesian Optimization over the Probability Simplex
Federico Pavesi, Antonio Candelieri, Noémie Jaquier
Main category: cs.LG
TL;DR: α-GaBO: A Bayesian optimization method for probability simplex domains using information geometry principles
Details
Motivation: Many real-world optimization problems involve probabilities and mixtures that naturally belong to the probability simplex (non-negative entries summing to one), which is a constrained non-Euclidean domain. Standard Bayesian optimization methods don't properly handle this geometry.Method: Uses information geometry (Riemannian geometry for probability distributions) to construct Matérn kernels that reflect the geometry of the probability simplex, and develops a one-parameter family of geometric optimizers for acquisition functions.
Result: Validated on benchmark functions and real-world applications including mixtures of components, mixtures of classifiers, and robotic control tasks, showing increased performance compared to constrained Euclidean approaches.
Conclusion: α-GaBO provides an effective Bayesian optimization framework for probability simplex domains by properly incorporating the geometric structure through information geometry principles.
Abstract: Bayesian optimization is a data-efficient technique that has been shown to be extremely powerful to optimize expensive, black-box, and possibly noisy objective functions. Many applications involve optimizing probabilities and mixtures which naturally belong to the probability simplex, a constrained non-Euclidean domain defined by non-negative entries summing to one. This paper introduces $α$-GaBO, a novel family of Bayesian optimization algorithms over the probability simplex. Our approach is grounded in information geometry, a branch of Riemannian geometry which endows the simplex with a Riemannian metric and a class of connections. Based on information geometry theory, we construct Matérn kernels that reflect the geometry of the probability simplex, as well as a one-parameter family of geometric optimizers for the acquisition function. We validate our method on benchmark functions and on a variety of real-world applications including mixtures of components, mixtures of classifiers, and a robotic control task, showing its increased performance compared to constrained Euclidean approaches.
[521] Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang
Main category: cs.LG
TL;DR: In-Context RLVR improves reinforcement learning for reasoning by weighting training traces based on their demonstration utility, measured via the model’s own in-context learning ability, leading to better reasoning quality and accuracy.
Details
Motivation: Standard RLVR treats all correct solutions equally, potentially reinforcing flawed reasoning traces that happen to get correct answers by chance. The authors observe that high-quality reasoning traces serve as better teaching demonstrations than low-quality ones.Method: Introduces In-Context RLVR which measures Demonstration Utility using the policy model’s own in-context learning ability, yielding Evidence Gain. This signal is used to reweight rewards during training, assigning higher weights to high-quality traces without requiring external evaluators.
Result: Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR, demonstrating that weighting traces by their teaching ability leads to better learning outcomes.
Conclusion: Better reasoning traces are better teachers, and leveraging the model’s own in-context learning ability to measure demonstration utility provides an efficient way to improve reinforcement learning for reasoning tasks.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that get correct answers by chance. We observe that better reasoning are better teachers: high-quality solutions serve as more effective demonstrations than low-quality ones. We term this teaching ability Demonstration Utility, and show that the policy model’s own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed Evidence Gain. To employ this signal during training, we introduce In-Context RLVR. By Bayesian analysis, we show that this objective implicitly reweights rewards by Evidence Gain, assigning higher weights to high-quality traces and lower weights to low-quality ones, without requiring costly computation or external evaluators. Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR.
[522] Correction of Transformer-Based Models with Smoothing Pseudo-Projector
Vitaly Bulgakov
Main category: cs.LG
TL;DR: Pseudo-projector is a lightweight modification for neural networks that reduces sensitivity to noise by suppressing label-irrelevant input directions, inspired by multigrid methods from numerical analysis.
Details
Motivation: To improve neural network robustness to noise and training dynamics without altering core architectures, by developing a method inspired by multigrid techniques from numerical analysis.Method: A pseudo-projector module that acts as a hidden-representation corrector, suppressing directions induced by label-irrelevant input content. Inspired by multigrid methods, it uses learnable restriction and prolongation operators to approximate orthogonal projection.
Result: Demonstrated effectiveness on transformer-based text classification tasks and synthetic benchmarks, showing improved training dynamics and robustness with consistent improvements across settings.
Conclusion: The pseudo-projector is a promising lightweight modification that enhances neural network robustness and training behavior, with plans to extend it to language models.
Abstract: The pseudo-projector is a lightweight modification that can be integrated into existing language models and other neural networks without altering their core architecture. It can be viewed as a hidden-representation corrector that reduces sensitivity to noise by suppressing directions induced by label-irrelevant input content. The design is inspired by the multigrid (MG) paradigm, originally developed to accelerate the convergence of iterative solvers for partial differential equations and boundary value problems, and later extended to more general linear systems through algebraic multigrid methods. We refer to the method as a pseudo-projector because its linear prototype corresponds to a strictly idempotent orthogonal projector, whereas the practical formulation employs learnable restriction and prolongation operators and therefore does not, in general, satisfy the properties of an exact orthogonal projection. We evaluate the proposed approach on transformer-based text classification tasks, as well as controlled synthetic benchmarks, demonstrating its effectiveness in improving training dynamics and robustness. Experimental results, together with supporting theoretical heuristics, indicate consistent improvements in training behavior across a range of settings, with no adverse effects observed otherwise. Our next step will be to extend this approach to language models.
[523] A Unified Hierarchical Multi-Task Multi-Fidelity Framework for Data-Efficient Surrogate Modeling in Manufacturing
Manan Mehta, Zhiqiao Dong, Yuhang Yang, Chenhui Shao
Main category: cs.LG
TL;DR: Hierarchical multi-task multi-fidelity Gaussian process framework for surrogate modeling that simultaneously addresses data heterogeneity and large data requirements in manufacturing systems.
Details
Motivation: Existing surrogate modeling approaches either address multi-task learning (for data efficiency) OR multi-fidelity modeling (for heterogeneous data), but not both together. Manufacturing systems need both capabilities to handle complex nonlinear relationships with limited data from varying fidelity sources.Method: Proposes H-MT-MF framework using hierarchical Bayesian Gaussian processes. Decomposes each task’s response into task-specific global trend and residual local variability component jointly learned across tasks. Accommodates arbitrary tasks, design points, and fidelity levels with uncertainty quantification.
Result: Improves prediction accuracy by up to 19% compared to state-of-the-art MTL model (without fidelity info) and 23% compared to stochastic kriging (independent tasks). Demonstrated on 1D synthetic example and real-world engine surface shape prediction case study.
Conclusion: Provides unified framework for surrogate modeling in manufacturing with heterogeneous data sources, effectively leveraging both inter-task similarity and fidelity-dependent characteristics through hierarchical Bayesian formulation.
Abstract: Surrogate modeling is an essential data-driven technique for quantifying relationships between input variables and system responses in manufacturing and engineering systems. Two major challenges limit its effectiveness: (1) large data requirements for learning complex nonlinear relationships, and (2) heterogeneous data collected from sources with varying fidelity levels. Multi-task learning (MTL) addresses the first challenge by enabling information sharing across related processes, while multi-fidelity modeling addresses the second by accounting for fidelity-dependent uncertainty. However, existing approaches typically address these challenges separately, and no unified framework simultaneously leverages inter-task similarity and fidelity-dependent data characteristics. This paper develops a novel hierarchical multi-task multi-fidelity (H-MT-MF) framework for Gaussian process-based surrogate modeling. The proposed framework decomposes each task’s response into a task-specific global trend and a residual local variability component that is jointly learned across tasks using a hierarchical Bayesian formulation. The framework accommodates an arbitrary number of tasks, design points, and fidelity levels while providing predictive uncertainty quantification. We demonstrate the effectiveness of the proposed method using a 1D synthetic example and a real-world engine surface shape prediction case study. Compared to (1) a state-of-the-art MTL model that does not account for fidelity information and (2) a stochastic kriging model that learns tasks independently, the proposed approach improves prediction accuracy by up to 19% and 23%, respectively. The H-MT-MF framework provides a general and extensible solution for surrogate modeling in manufacturing systems characterized by heterogeneous data sources.
[524] A Graph-Based Approach to Spectrum Demand Prediction Using Hierarchical Attention Networks
Mohamad Alkadamani, Halim Yanikomeroglu, Amir Ghasemi
Main category: cs.LG
TL;DR: HR-GAT: Hierarchical resolution graph attention network for predicting spectrum demand using geospatial data, improving accuracy by 21% over baselines in Canadian cities.
Details
Motivation: Growing wireless connectivity demand with finite spectrum resources requires efficient spectrum management. Spectrum sharing needs precise demand characterization for informed policy-making, but standard ML models struggle with complex spatial patterns and spatial autocorrelation issues.Method: HR-GAT (Hierarchical Resolution Graph Attention Network) model that uses graph attention networks to handle geospatial data. It addresses spatial autocorrelation problems and complex spatial demand patterns through hierarchical resolution approach.
Result: Tested across five major Canadian cities, HR-GAT improves predictive accuracy of spectrum demand by 21% over eight baseline models, demonstrating superior performance and reliability.
Conclusion: HR-GAT provides an effective solution for spectrum demand prediction using geospatial data, addressing spatial autocorrelation challenges and enabling better spectrum management and sharing policies.
Abstract: The surge in wireless connectivity demand, coupled with the finite nature of spectrum resources, compels the development of efficient spectrum management approaches. Spectrum sharing presents a promising avenue, although it demands precise characterization of spectrum demand for informed policy-making. This paper introduces HR-GAT, a hierarchical resolution graph attention network model, designed to predict spectrum demand using geospatial data. HR-GAT adeptly handles complex spatial demand patterns and resolves issues of spatial autocorrelation that usually challenge standard machine learning models, often resulting in poor generalization. Tested across five major Canadian cities, HR-GAT improves predictive accuracy of spectrum demand by 21% over eight baseline models, underscoring its superior performance and reliability.
[525] GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection
Kai Yao, Zhenghan Song, Kaixin Wu, Mingjie Zhong, Danzhao Cheng, Zhaorui Tan, Yixin Ji, Penglei Gao
Main category: cs.LG
TL;DR: GAST is a parameter-efficient fine-tuning method that simultaneously performs selective fine-tuning at both data and layer dimensions using a unified optimization strategy to address redundancy in information.
Details
Motivation: Current PEFT methods focus on either layer-selective or data-selective approaches, but overlook that different data points contribute varying degrees to distinct model layers and discard potentially valuable information from low-quality data.Method: Gradient-aligned Sparse Tuning (GAST) employs a layer-sparse strategy that adaptively selects the most impactful data points for each layer, performing selective fine-tuning at both data and layer dimensions as part of a unified optimization strategy.
Result: Experiments demonstrate that GAST consistently outperforms baseline methods in parameter-efficient fine-tuning.
Conclusion: GAST establishes a promising direction for future research in PEFT strategies by providing a more comprehensive solution than approaches restricted to a single dimension.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become a key strategy for adapting large language models, with recent advances in sparse tuning reducing overhead by selectively updating key parameters or subsets of data. Existing approaches generally focus on two distinct paradigms: layer-selective methods aiming to fine-tune critical layers to minimize computational load, and data-selective methods aiming to select effective training subsets to boost training. However, current methods typically overlook the fact that different data points contribute varying degrees to distinct model layers, and they often discard potentially valuable information from data perceived as of low quality. To address these limitations, we propose Gradient-aligned Sparse Tuning (GAST), an innovative method that simultaneously performs selective fine-tuning at both data and layer dimensions as integral components of a unified optimization strategy. GAST specifically targets redundancy in information by employing a layer-sparse strategy that adaptively selects the most impactful data points for each layer, providing a more comprehensive and sophisticated solution than approaches restricted to a single dimension. Experiments demonstrate that GAST consistently outperforms baseline methods, establishing a promising direction for future research in PEFT strategies.
[526] CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning
Aleksei Rozanov, Arvind Renganathan, Yimeng Zhang, Vipin Kumar
Main category: cs.LG
TL;DR: CarbonBench is a benchmark for zero-shot spatial transfer learning in carbon flux upscaling, featuring 1.3M daily observations from 567 global flux tower sites with stratified evaluation protocols for testing generalization across unseen vegetation types and climate regimes.
Details
Motivation: There's a need to accurately quantify terrestrial carbon exchange for climate policy, but models must generalize to ecosystems underrepresented in sparse eddy covariance observations. No standardized benchmark exists to evaluate model performance across geographically distinct locations with different climate regimes and vegetation types.Method: Created CarbonBench with: (1) stratified evaluation protocols testing generalization across unseen vegetation types and climate regimes, separating spatial transfer from temporal autocorrelation; (2) harmonized remote sensing and meteorological features for flexible architecture design; (3) baselines from tree-based methods to domain-generalization architectures.
Result: CarbonBench comprises over 1.3 million daily observations from 567 flux tower sites globally (2000-2024), providing the first standardized benchmark for zero-shot spatial transfer in carbon flux upscaling.
Conclusion: CarbonBench bridges machine learning methodologies and Earth system science, enabling systematic comparison of transfer learning methods, serving as a testbed for regression under distribution shift, and contributing to next-generation climate modeling efforts.
Abstract: Accurately quantifying terrestrial carbon exchange is essential for climate policy and carbon accounting, yet models must generalize to ecosystems underrepresented in sparse eddy covariance observations. Despite this challenge being a natural instance of zero-shot spatial transfer learning for time series regression, no standardized benchmark exists to rigorously evaluate model performance across geographically distinct locations with different climate regimes and vegetation types. We introduce CarbonBench, the first benchmark for zero-shot spatial transfer in carbon flux upscaling. CarbonBench comprises over 1.3 million daily observations from 567 flux tower sites globally (2000-2024). It provides: (1) stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, separating spatial transfer from temporal autocorrelation; (2) a harmonized set of remote sensing and meteorological features to enable flexible architecture design; and (3) baselines ranging from tree-based methods to domain-generalization architectures. By bridging machine learning methodologies and Earth system science, CarbonBench aims to enable systematic comparison of transfer learning methods, serves as a testbed for regression under distribution shift, and contributes to the next-generation climate modeling efforts.
[527] OptEMA: Adaptive Exponential Moving Average for Stochastic Optimization with Zero-Noise Optimality
Ganzhao Yuan
Main category: cs.LG
TL;DR: OptEMA introduces two novel adaptive EMA variants for optimization that achieve noise-adaptive convergence rates without requiring Lipschitz constants or boundedness assumptions.
Details
Motivation: Existing theoretical analyses of Adam-style optimizers have limitations: suboptimal guarantees in zero-noise regime, restrictive boundedness conditions, constant/open-loop stepsizes, or requiring prior knowledge of Lipschitz constants.Method: Introduces OptEMA with two variants: OptEMA-M (adaptive decreasing EMA coefficient to first-order moment with fixed second-order decay) and OptEMA-V (swaps these roles). The method is closed-loop and Lipschitz-free with trajectory-dependent effective stepsizes.
Result: Both variants achieve noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+σ^{1/2} T^{-1/4})$ for average gradient norm under standard SGD assumptions. In zero-noise regime ($σ=0$), bounds reduce to nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without hyperparameter retuning.
Conclusion: OptEMA overcomes theoretical bottlenecks of existing Adam-style methods by providing rigorous convergence guarantees without restrictive assumptions, achieving adaptive rates that bridge deterministic and stochastic regimes.
Abstract: The Exponential Moving Average (EMA) is a cornerstone of widely used optimizers such as Adam. However, existing theoretical analyses of Adam-style methods have notable limitations: their guarantees can remain suboptimal in the zero-noise regime, rely on restrictive boundedness conditions (e.g., bounded gradients or objective gaps), use constant or open-loop stepsizes, or require prior knowledge of Lipschitz constants. To overcome these bottlenecks, we introduce OptEMA and analyze two novel variants: OptEMA-M, which applies an adaptive, decreasing EMA coefficient to the first-order moment with a fixed second-order decay, and OptEMA-V, which swaps these roles. Crucially, OptEMA is closed-loop and Lipschitz-free in the sense that its effective stepsizes are trajectory-dependent and do not require the Lipschitz constant for parameterization. Under standard stochastic gradient descent (SGD) assumptions, namely smoothness, a lower-bounded objective, and unbiased gradients with bounded variance, we establish rigorous convergence guarantees. Both variants achieve a noise-adaptive convergence rate of $\widetilde{\mathcal{O}}(T^{-1/2}+σ^{1/2} T^{-1/4})$ for the average gradient norm, where $σ$ is the noise level. In particular, in the zero-noise regime where $σ=0$, our bounds reduce to the nearly optimal deterministic rate $\widetilde{\mathcal{O}}(T^{-1/2})$ without manual hyperparameter retuning.
[528] Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective
Erkan Turan, Maks Ovsjanikov
Main category: cs.LG
TL;DR: The paper provides theoretical foundations for Generative Modeling via Drifting, showing it’s equivalent to score matching on smoothed distributions, analyzing kernel choices, explaining the need for stop-gradient, and proposing improvements.
Details
Motivation: To establish theoretical understanding of the empirically successful Generative Modeling via Drifting method, which lacks proper theoretical foundations despite achieving state-of-the-art one-step image generation.Method: Analyzes drifting as score matching on smoothed distributions, uses Fourier analysis to study convergence, proposes exponential bandwidth annealing, and formalizes drifting as Wasserstein gradient flow of smoothed KL divergence.
Result: Answers three key open questions about drifting, explains kernel choice preferences, provides theoretical justification for stop-gradient, and proposes improvements that reduce convergence time from exponential to logarithmic.
Conclusion: The paper provides comprehensive theoretical foundations for drifting, positioning it within score-matching framework, explaining empirical observations, and enabling principled improvements to the method.
Abstract: Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $σ(t)=σ_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
[529] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti, Patitapaban Palo, David W. Eyre, David A. Clifton
Main category: cs.LG
TL;DR: SignalMC-MED benchmark evaluates biosignal foundation models on synchronized ECG and PPG data across 20 clinical tasks, showing domain-specific models outperform general time-series models and multimodal fusion improves performance.
Details
Motivation: There's a need for systematic evaluation of biosignal foundation models on long-duration multimodal data, as existing benchmarks are limited despite promising performance in clinical prediction tasks.Method: Created SignalMC-MED benchmark from MC-MED dataset with 22,256 visits of 10-minute overlapping ECG and PPG signals. Evaluated time-series and biosignal FMs across ECG-only, PPG-only, and ECG+PPG settings on 20 clinical tasks including demographics, emergency disposition, lab value regression, and ICD-10 diagnosis detection.
Result: Domain-specific biosignal FMs consistently outperform general time-series models. Multimodal ECG+PPG fusion yields robust improvements over unimodal inputs. Full 10-minute signals outperform shorter segments. Larger model variants don’t reliably outperform smaller ones. Hand-crafted ECG features provide strong baseline and complement learned representations.
Conclusion: SignalMC-MED establishes a standardized benchmark for biosignal FMs and provides practical guidance for evaluation and deployment, emphasizing the value of domain-specific models and multimodal fusion for clinical applications.
Abstract: Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long-duration multimodal data remains limited. We introduce SignalMC-MED, a benchmark for evaluating biosignal FMs on synchronized single-lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC-MED dataset, SignalMC-MED comprises 22,256 visits with 10-minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD-10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time-series and biosignal FMs across ECG-only, PPG-only, and ECG + PPG settings. We find that domain-specific biosignal FMs consistently outperform general time-series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10-minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand-crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC-MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.
[530] When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic
Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
Main category: cs.LG
TL;DR: Proposes using Overfitting-Underfitting Indicator (OUI) to predict optimal learning rates for PPO in deep RL by analyzing neuron activation patterns early in training.
Details
Motivation: Deep RL systems are highly sensitive to learning rate selection, requiring extensive hyperparameter search. Small LRs cause slow convergence while large LRs cause instability/collapse. Need early indicators to identify promising training runs without full training.Method: Introduces batch-based formulation of OUI metric that quantifies balance of binary activation patterns over probe batch. Derives theoretical connection between LR and activation sign changes. Uses OUI measured at 10% of training to discriminate between LR regimes across three discrete-control environments.
Result: OUI measured early discriminates LR regimes effectively. Critic networks achieve highest return in intermediate OUI band (avoiding saturation), while actor networks achieve highest return with comparatively high OUI values. OUI provides strongest early screening signal compared to other criteria.
Conclusion: OUI enables aggressive pruning of unpromising training runs early without requiring full training. Combining early return with OUI yields highest precision in best-performing screening regimes.
Abstract: Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor–critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron’s inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.
[531] Towards a Neural Debugger for Python
Maximilian Beck, Jonas Gehring, Jannik Kossen, Gabriel Synnaeve
Main category: cs.LG
TL;DR: Neural debuggers are language models that emulate traditional debuggers, supporting interactive debugging operations like stepping through code and setting breakpoints, enabling both forward and inverse execution prediction.
Details
Motivation: Current neural interpreters lack interactive control like real debuggers, which developers use to stop at breakpoints and step through relevant portions while inspecting variables. The goal is to create models that can emulate traditional debugging operations.Method: Develop neural debuggers by fine-tuning large LLMs or pre-training smaller models from scratch to support debugger operations (stepping into/over/out of functions, setting breakpoints). Models learn to predict both forward execution (future states/outputs) and inverse execution (prior states/inputs) conditioned on debugger actions.
Result: Models achieve strong performance on CruxEval benchmark for both output and input prediction tasks, demonstrating robust conditional execution modeling. They can reliably model both forward and inverse execution conditioned on debugger actions.
Conclusion: Neural debuggers represent first steps toward agentic coding systems where they serve as world models for simulated debugging environments, providing execution feedback and enabling interaction with real debugging tools, laying foundation for better code generation, program understanding, and automated debugging.
Abstract: Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers – obtained via fine-tuning large LLMs or pre-training smaller models from scratch – can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.
[532] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
Ruihan Xu, Jiajin Li, Yiping Lu
Main category: cs.LG
TL;DR: The paper introduces MOGA, a width-aware optimizer based on matrix operator norms that enables stable learning-rate transfer across different model widths, addressing optimization stability in deep neural networks as width increases.
Details
Motivation: Modern deep learning needs optimizers that remain stable as network width increases. Current optimizers like AdamW and Muon lack width-independent guarantees, leading to instability in training large models.Method: Interpret optimizers as steepest descent under matrix operator norms, introduce mean-normalized operator norms (pmean → qmean) for layerwise composability, propose MOGA optimizer with row/column-wise normalization for width-independent smoothness bounds.
Result: MOGA enables stable learning-rate transfer across widths, recovers μP scaling as special case, shows Muon can suffer O(√w) smoothness growth while row-normalized optimizers achieve width-independent guarantees. Large-scale pre-training on GPT-2 and LLaMA shows MOGA competitive with Muon and faster in large-token/low-loss regimes.
Conclusion: Matrix operator norm perspective provides principled framework for width-aware optimization. MOGA with row normalization offers practical solution for stable training across model widths, advancing optimization theory for large-scale models.
Abstract: A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.
[533] From Data Statistics to Feature Geometry: How Correlations Shape Superposition
Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano
Main category: cs.LG
TL;DR: The paper introduces Bag-of-Words Superposition (BOWS) to study how neural networks represent correlated features in superposition, showing that interference can be constructive rather than just noise when features are arranged by co-activation patterns.
Details
Motivation: Current understanding of superposition in mechanistic interpretability focuses on sparse, uncorrelated features where interference is treated as noise to be filtered out. This paper aims to study how neural networks handle correlated features in realistic data settings like internet text.Method: Introduces Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Analyzes how features are arranged according to co-activation patterns and how ReLUs handle interference.
Result: Found that when features are correlated, interference can be constructive rather than just noise. Features arranged by co-activation patterns make interference between active features constructive while using ReLUs to avoid false positives. This arrangement is more prevalent with weight decay and naturally creates semantic clusters and cyclical structures observed in real language models.
Conclusion: The standard picture of superposition as introducing interference to be minimized is incomplete for realistic data. Correlated features can be arranged to make interference constructive, explaining semantic clusters and cyclical structures in language models that weren’t explained by previous superposition theories.
Abstract: A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.
[534] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes
Aleksei Rozanov, Arvind Renganathan, Vipin Kumar
Main category: cs.LG
TL;DR: TAM-RL framework improves global carbon flux estimation by combining spatio-temporal representation learning with physics-guided constraints, achieving 8-9.6% RMSE reduction and increased explained variance across diverse biomes.
Details
Motivation: Current data-driven upscaling methods for terrestrial carbon fluxes suffer from poor generalization beyond observed domains, leading to systematic regional biases and high predictive uncertainty due to sparse and biased ground measurements.Method: Task-Aware Modulation with Representation Learning (TAM-RL) couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation.
Result: Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL reduces RMSE by 8-9.6% and increases explained variance (R²) from 19.4% to 43.8% compared to state-of-the-art datasets.
Conclusion: Integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
Abstract: Accurately upscaling terrestrial carbon fluxes is central to estimating the global carbon budget, yet remains challenging due to the sparse and regionally biased distribution of ground measurements. Existing data-driven upscaling products often fail to generalize beyond observed domains, leading to systematic regional biases and high predictive uncertainty. We introduce Task-Aware Modulation with Representation Learning (TAM-RL), a framework that couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation. Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL improves predictive performance relative to existing state-of-the-art datasets, reducing RMSE by 8-9.6% and increasing explained variance ($R^2$) from 19.4% to 43.8%, depending on the target flux. These results demonstrate that integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
[535] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
Main category: cs.LG
TL;DR: The paper studies honesty elicitation and lie detection techniques on Chinese LLMs trained to censor politically sensitive topics, finding that certain prompting and fine-tuning methods improve truthfulness but don’t fully eliminate false responses.
Details
Motivation: Previous work evaluates honesty techniques on artificially trained lying models, but this may not reflect naturally-occurring dishonesty. The authors instead study Chinese LLMs that are trained to censor politically sensitive topics, which produce falsehoods while possessing suppressed knowledge, providing a more realistic testbed.Method: Use open-weights LLMs from Chinese developers (Qwen3 models) trained to censor topics like Falun Gong and Tiananmen protests. Evaluate honesty elicitation techniques (sampling without chat template, few-shot prompting, fine-tuning on generic honesty data) and lie detection methods (prompting censored model to classify its own responses, linear probes on unrelated data). Test transferability to frontier models like DeepSeek R1.
Result: For honesty elicitation: sampling without chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection: prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes on unrelated data offer cheaper alternatives. Techniques transfer to frontier models but no method fully eliminates false responses.
Conclusion: The study provides a realistic testbed for evaluating honesty techniques using naturally-occurring censorship in LLMs. While certain methods improve truthfulness, complete elimination of false responses remains challenging, highlighting limitations of current approaches to model honesty.
Abstract: Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation – modifying prompts or weights so that the model answers truthfully – and lie detection – classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
[536] XConv: Low-memory stochastic backpropagation for convolutional layers
Anirudh Thatipelli, Jeffrey Sam, Mathias Louboutin, Ali Siahkoohi, Rongrong Wang, Felix J. Herrmann
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2106.06998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2106.06998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] A Survey on Decentralized Federated Learning
Edoardo Gabrielli, Anthony Di Pietro, Dario Fenoglio, Giovanni Pica, Gabriele Tolomei
Main category: cs.LG
TL;DR: Unable to analyze paper 2308.04604 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2308.04604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2308.04604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets
Arthur da Cunha, Francesco d’Amore, Emanuele Natale
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2311.09858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.09858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] Provable Filter for Real-world Graph Clustering
Xuanting Xie, Erlin Pan, Zhao Kang, Wenyu Chen, Bingheng Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2403.03666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.03666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] HYGENE: A Diffusion-based Hypergraph Generation Method
Dorian Gailhard, Enzo Tartaglione, Lirida Naviner, Jhony H. Giraldo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available due to failed API request
Conclusion: Cannot draw conclusions about the paper due to technical access issues
Abstract: Failed to fetch summary for 2408.16457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.16457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning
Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2409.18827: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.18827&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] Scalable Message Passing Neural Networks: No Need for Attention in Large Graph Representation Learning
Haitz Sáez de Ocáriz Borde, Artem Lukoianov, Anastasis Kratsios, Michael Bronstein, Xiaowen Dong
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2411.00835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] When Machine Learning Gets Personal: Evaluating Prediction and Explanation
Louisa Cornelis, Guillermo Bernárdez, Haewon Jeong, Nina Miolane
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2502.02786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Improving clustering quality evaluation in noisy Gaussian mixtures
Renato Cordeiro de Amorim, Vladimir Makarenkov
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.00379: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.00379&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Experiments with Optimal Model Trees
Sabino Francesco Roselli, Eibe Frank
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical error in fetching paper information
Conclusion: Paper analysis not possible due to arXiv API rate limiting (HTTP 429 error)
Abstract: Failed to fetch summary for 2503.12902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] The Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM
Nikhil Kapasi, Mohamed Elfouly, William Whitehead, Luke Theogarajan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.11635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] JULI: Jailbreak Large Language Models by Self-Introspection
Jesson Wang, Zhanhao Hu, David Wagner
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.11790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] Discovering Symbolic Differential Equations with Symmetry Invariants
Jianke Yang, Manu Bhat, Bryan Hu, Yadi Cao, Nima Dehmamy, Robin Walters, Rose Yu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.12083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources
Qingyu Song, Rui Liu, Wei Lin, Peiyu Liao, Wenqian Zhao, Yiwen Wang, Shoubo Hu, Yining Jiang, Mochun Long, Hui-Ling Zhen, Ning Jiang, Mingxuan Yuan, Qiao Xiang, Hong Xu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2505.15030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
Shengyu Feng, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2505.16952: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16952&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
Xuanning Zhou, Zihao Shi, Hao Zeng, Xiaobo Xia, Bingyi Jing, Hongxin Wei
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2505.21147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] Pure Exploration with Infinite Answers
Riccardo Poiani, Martino Bernasconi, Andrea Celli
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2505.22473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Operator Learning for Consolidation: An Architectural Comparison for DeepONet Variants
Yongjin Choi, Chenying Liu, Jorge Macedo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2507.10368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] Langevin Flows for Modeling Neural Latent Dynamics
Yue Song, T. Anderson Keller, Yisong Yue, Pietro Perona, Max Welling
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.11531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.05433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets
Gauri Jain, Dominik Rothenhäusler, Kirk Bansak, Elisabeth Paulson
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.11144: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11144&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] RF-Informed Graph Neural Networks for Accurate and Data-Efficient Circuit Performance Prediction
Anahita Asadi, Leonid Popryho, Inna Partin-Vaisband
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2508.16403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] Iterative In-Context Learning to Enhance LLMs Abstract Reasoning: The Case-Study of Algebraic Tasks
Stefano Fioravanti, Matteo Zavatteri, Roberto Confalonieri, Kamyar Zeinalipour, Paolo Frazzetto, Alessandro Sperduti, Nicolò Navarin
Main category: cs.LG
TL;DR: Paper ID 2509.01267: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot draw conclusions due to missing abstract content
Abstract: Failed to fetch summary for 2509.01267: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01267&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] A Surrogate model for High Temperature Superconducting Magnets to Predict Current Distribution with Neural Network
Mianjun Xiao, Peng Song, Yulong Liu, Cedric Korte, Ziyang Xu, Jiale Gao, Jiaqi Lu, Haoyang Nie, Qiantong Deng, Timing Qu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.06067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.06067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse
Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.23183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking
Shaifalee Saxena, Alan Williams, Rafael Fierro, Alexander Scheinker
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2510.02490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] Bradley-Terry Policy Optimization for Generative Preference Modeling
Shengyu Feng, Yun He, Shuang Ma, Beibin Li, Yuanhao Xiong, Songlin Li, Karishma Mandyam, Julian Katz-Samuels, Shengjie Bi, Licheng Yu, Hejia Zhang, Karthik Abinav Sankararaman, Han Fang, Yiming Yang, Manaal Faruqui
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.15242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] SA$^{2}$GFM: Enhancing Robust Graph Foundation Models with Structure-Aware Semantic Augmentation
Junhua Shi, Qingyun Sun, Haonan Yuan, Xingcheng Fu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.07857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] The Affine Divergence: Aligning Activation Updates Beyond Normalisation
George Bird
Main category: cs.LG
TL;DR: Unable to analyze paper 2512.22247 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.22247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching
Cheng Zeng, Harry W. Sullivan, Thomas Egg, Maya M. Martirossyan, Philipp Höllmer, Jirui Jin, Richard G. Hennig, Adrian Roitberg, Stefano Martiniani, Ellad B. Tadmor, Mingjie Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.16020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] Detecting Transportation Mode Using Dense Smartphone GPS Trajectories and Transformer Models
Yuandong Zhang, Othmane Echchabi, Tianshu Feng, Wenyi Zhang, Hsuai-Kai Liao, Charles Chang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.00340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking
Gilad Turok, Chris De Sa, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: The paper “2603.01367” could not be analyzed due to HTTP 429 error when fetching the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: No method information available due to failed content retrieval.
Result: No results available as the paper content could not be accessed.
Conclusion: Unable to provide conclusion due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2603.01367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence
Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.05960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] Khatri-Rao Clustering for Data Summarization
Martino Ciaperoni, Collin Leiber, Aristides Gionis, Heikki Mannila
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable due to server rate limiting
Result: Cannot determine results as paper content is unavailable due to server rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to server rate limiting
Abstract: Failed to fetch summary for 2603.06602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] FedPrism: Adaptive Personalized Federated Learning under Non-IID Data
Prakash Kumbhakar, Shrey Srivastava, Haroon R Lone
Main category: cs.LG
TL;DR: Paper 2603.08252: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.08252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] MUSA-PINN: Multi-scale Weak-form Physics-Informed Neural Networks for Fluid Flow in Complex Geometries
Weizheng Zhang, Xunjie Xie, Hao Pan, Xiaowei Duan, Bingteng Sun, Qiang Du, Lin Lu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.08465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting
Azul Garza, Renée Rosillo, Rodrigo Mendoza-Smith, David Salinas, Andrew Robert Williams, Arjun Ashok, Mononito Goswami, José Martín Juárez
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.08707 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.08707: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08707&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps
Asif Hamid, Danish Rafiq, Shahkar Ahmad Nahvi, Mohammad Abid Bazaz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - paper content inaccessible due to HTTP 429 error
Result: No results available - technical issue prevented accessing the paper
Conclusion: Cannot analyze paper due to API rate limiting (HTTP 429) preventing content retrieval
Abstract: Failed to fetch summary for 2407.00011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.00011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] Calabi-Yau metrics through Grassmannian learning and Donaldson’s algorithm
Carl Henrik Ek, Oisin Kim, Challenger Mishra
Main category: cs.LG
TL;DR: Paper 2410.11284: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2410.11284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.11284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] Adaptive and Stratified Subsampling for High-Dimensional Robust Estimation
Prateek Mittal, Joohi Chauhan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable due to server rate limiting
Result: Cannot determine results as paper content is unavailable due to server rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to server rate limiting
Abstract: Failed to fetch summary for 2410.12367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.12367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] SPDIM: Source-Free Unsupervised Conditional and Label Shift Adaptation in EEG
Shanglin Li, Motoaki Kawanabe, Reinmar J. Kobler
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2411.07249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.07249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] Prognostics for Autonomous Deep-Space Habitat Health Management under Multiple Unknown Failure Modes
Benjamin Peters, Ayush Mohanty, Xiaolei Fang, Stephen K. Robinson, Nagi Gebraeel
Main category: cs.LG
TL;DR: Paper 2411.12159: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2411.12159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.12159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] Morphological-Symmetry-Equivariant Heterogeneous Graph Neural Network for Robotic Dynamics Learning
Fengze Xie, Sizhe Wei, Yue Song, Yisong Yue, Lu Gan
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2412.01297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] Molecular Fingerprints Are Strong Models for Peptide Function Prediction
Jakub Adamczyk, Piotr Ludynia, Wojciech Czech
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2501.17901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.17901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation
Georgios Kamaras, Subramanian Ramamoorthy
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.18615: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.18615&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning
Haochen Zhang, Zhong Zheng, Lingzhou Xue
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.04626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] Uncovering Social Network Activity Using Joint User and Topic Interaction
Gaspard Abel, Argyris Kalogeratos, Jean-Pierre Nadal, Julien Randon-Furling
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.12842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
Gilad Lerman, Kang Li, Tyler Maunu, Teng Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access issues
Conclusion: Paper analysis not possible due to arXiv API rate limiting
Abstract: Failed to fetch summary for 2506.20533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes
Marcel Hudiani
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2507.07281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance
Vladimir Petrovic, Rémi Bardenet, Agnès Desolneux
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.10166 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.10166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Tobias Jülg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.14932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition
Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, Wen Zhao, Qiang Zhang, Yijie Guo, Qihao Zheng, Chunfeng Song, Xiao Li, Ping Luo, Andrew F. Luo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.01068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] An Interpretable Operator-Learning Model for Electric Field Profile Reconstruction in Discharges Based on the EFISH Method
Zhijian Yang, Edwin Setiadi Sugeng, Mhedine Alicherif, Tat Loon Chng
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.00359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] Do Spatial Descriptors Improve Multi-DoF Finger Movement Decoding from HD sEMG?
Ricardo Gonçalves Molinari, Leonardo Abdala Elias
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.13870: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13870&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] Enhancing Reconstruction Capability of Wavelet Transform Amorphous Radial Distribution Function via Machine Learning Assisted Parameter Tuning
Deriyan Senjaya, Stephen Ekaputra Limantoro
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.17245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] Provable Acceleration of Distributed Optimization with Local Updates
Zuang Wang, Yongqiang Wang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.03442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Robust Assortment Optimization from Observational Data
Miao Lu, Yuxuan Han, Han Zhong, Zhengyuan Zhou, Jose Blanchet
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.10696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values
Shengbo Wang, Nian Si
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2603.00945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] PolyBlocks: A Compiler Infrastructure for AI Chips and Programming Frameworks
Uday Bondhugula, Akshay Baviskar, Navdeep Katel, Vimal Patel, Anoop JS, Arnab Dutta
Main category: cs.LG
TL;DR: Paper ID 2603.06731 appears to be unavailable due to HTTP 429 error (rate limiting), preventing access to the abstract and content for analysis.
Details
Motivation: Unable to determine motivation due to technical limitations preventing access to the paper content.Method: No method information available as the paper content could not be retrieved.
Result: No results available due to inability to access the paper content.
Conclusion: Technical limitations prevent analysis of this paper.
Abstract: Failed to fetch summary for 2603.06731: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06731&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Zihao Zheng, Zhihao Mao, Xingyue Zhou, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data retrieval failureMethod: Unable to determine method due to data retrieval failure
Result: Unable to determine results due to data retrieval failure
Conclusion: Unable to draw conclusions due to data retrieval failure
Abstract: Failed to fetch summary for 2603.07080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[596] The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation
Vladyslav Parakhin
Main category: cs.MA
TL;DR: A capability coherence system (CCS) that maps cache coherence states to authorization states to prevent unauthorized API calls in agentic systems, achieving 120-184x reduction in unauthorized operations compared to time-based approaches.
Details
Motivation: Traditional identity and access management fails in agentic execution regimes where temporal assumptions collapse - 60-second revocation windows allow thousands of unauthorized API calls at scale, creating a coherence problem beyond just latency.Method: Define a Capability Coherence System (CCS) with state-mapping φ from MESI cache coherence states to authorization states, preserving transition structure under bounded-staleness semantics. Use Release Consistency-directed Coherence (RCC) strategy with tick-based discrete event simulation across business scenarios.
Result: RCC achieves 120x reduction vs TTL-based lease in high-velocity scenario (50 vs 6,000 unauthorized operations), 184x under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm per-capability safety guarantee.
Conclusion: The coherence-based approach provides qualitative improvement over time-bounded strategies, bounding unauthorized operations independent of agent velocity rather than scaling linearly with velocity and TTL.
Abstract: The temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty-second revocation window permits on the order of $6 \times 10^3$ unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches $6 \times 10^5$. This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state-mapping $\varphi : Σ_{\rm MESI} \to Σ_{\rm auth}$ preserving transition structure under bounded-staleness semantics. A safety theorem bounds unauthorized operations for the execution-count Release Consistency-directed Coherence (RCC) strategy at $D_{\rm rcc} \leq n$, independent of agent velocity $v$ – a qualitative departure from the $O(v \cdot \mathrm{TTL})$ scaling of time-bounded strategies. Tick-based discrete event simulation across three business-contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a $120\times$ reduction versus TTL-based lease in the high-velocity scenario (50 vs. 6,000 unauthorized operations), and $184\times$ under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm the per-capability safety guarantee. Simulation code: https://github.com/hipvlady/prizm
[597] Emotional Modulation in Swarm Decision Dynamics
David Freire-Obregón
Main category: cs.MA
TL;DR: Extends bee equation to agent-based model with emotional valence/arousal modulating interaction rates, studying emotional contagion in consensus formation through three scenarios.
Details
Motivation: To bridge swarm decision theory with affective modeling by incorporating emotional states (valence and arousal) as modulators of interaction rates in collective decision-making, enabling study of emotional contagion effects on consensus formation.Method: Extends the bee equation to an agent-based model where emotional valence (positive-negative) and arousal (low-high) modulate recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from valence-arousal states to study emotional contagion. Three scenarios explored: (1) joint effect of valence/arousal on consensus, (2) arousal’s role in breaking ties, (3) “snowball effect” acceleration after threshold crossing.
Result: Emotional modulation biases decision outcomes and alters convergence times by shifting effective recruitment/inhibition rates. Intrinsic non-linear amplification produces decisive wins even in symmetric emotional conditions. Arousal can break ties when valence is matched, and snowball effects accelerate consensus after surpassing intermediate thresholds.
Conclusion: Links classical swarm decision theory with affective/social modeling, showing how emotional asymmetries and structural tipping points shape collective outcomes. Provides flexible framework for studying emotional dimensions of collective choice in natural/artificial systems.
Abstract: Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the “snowball effect” in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
[598] Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards
Jahir Sadik Monon, Deeparghya Dutta Barua, Md. Mosaddek Khan
Main category: cs.MA
TL;DR: CoHet algorithm uses GNN-based intrinsic motivation for decentralized training of heterogeneous multi-agent systems under partial observability and sparse rewards.
Details
Motivation: Real-world multi-agent systems require decentralized training, handle diverse agents, and learn from sparse rewards, but existing methods assume centralized training, parameter sharing, and agent indexing when dealing with heterogeneity.Method: Proposes CoHet algorithm with Graph Neural Network-based intrinsic motivation to learn heterogeneous agent policies in decentralized settings under partial observability and reward sparsity.
Result: CoHet demonstrates superior performance compared to state-of-the-art methods in Multi-agent Particle Environment and Vectorized Multi-Agent Simulator benchmarks across cooperative scenarios.
Conclusion: CoHet effectively addresses challenges of decentralized heterogeneous multi-agent learning with sparse rewards through GNN-based intrinsic motivation, showing robustness with increasing agent numbers.
Abstract: Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.
[599] Cooperative Game-Theoretic Credit Assignment for Multi-Agent Policy Gradients via the Core
Mengda Ji, Genjiu Xu, Keke Jia, Zekun Duan, Yong Qiu, Jianjun Ge, Mingqiang Li
Main category: cs.MA
TL;DR: CORA: A cooperative game-theoretic advantage allocation method for multi-agent reinforcement learning that addresses credit assignment by evaluating coalitional contributions rather than individual agent contributions.
Details
Motivation: The paper addresses the credit assignment problem in cooperative multi-agent reinforcement learning, where sharing global advantage among agents often leads to insufficient policy optimization because it fails to capture the coalitional contributions of different agents working together.Method: CORA revisits policy update from a coalitional perspective, using cooperative game-theoretic core allocation to evaluate marginal contributions of different coalitions. It combines clipped double Q-learning to mitigate overestimation bias and estimates coalition-wise advantages. Random coalition sampling is employed to reduce computational overhead.
Result: Experiments on matrix games, differential games, and multi-agent collaboration benchmarks demonstrate that CORA outperforms baseline methods, showing improved performance in multi-agent coordination tasks.
Conclusion: The findings highlight the importance of coalition-level credit assignment and cooperative game theory for advancing multi-agent learning, providing a more effective way to attribute global advantage to different coalition strategies and promote coordinated optimal behavior.
Abstract: This work focuses on the credit assignment problem in cooperative multi-agent reinforcement learning (MARL). Sharing the global advantage among agents often leads to insufficient policy optimization, as it fails to capture the coalitional contributions of different agents. In this work, we revisit the policy update process from a coalitional perspective and propose CORA, an advantage allocation method guided by a cooperative game-theoretic core allocation. By evaluating the marginal contributions of different coalitions and combining clipped double Q-learning to mitigate overestimation bias, CORA estimates coalition-wise advantages. The core formulation enforces coalition-wise lower bounds on allocated credits, so that coalitions with higher advantages receive stronger total incentives for their participating agents, enabling the global advantage to be attributed to different coalition strategies and promoting coordinated optimal behavior. To reduce computational overhead, we employ random coalition sampling to approximate the core allocation efficiently. Experiments on matrix games, differential games, and multi-agent collaboration benchmarks demonstrate that our method outperforms baselines. These findings highlight the importance of coalition-level credit assignment and cooperative games for advancing multi-agent learning.
[600] Polynomial-time Configuration Generator for Connected Unlabeled Multi-Agent Pathfinding
Takahiro Suzuki, Keisuke Okumura
Main category: cs.MA
TL;DR: CUMAPF: Multi-agent pathfinding with connectivity constraints for swarm robotics, solved via ILP (optimal but slow) and PULL algorithm (suboptimal but fast).
Details
Motivation: Standard MAPF doesn't guarantee connectivity, which is crucial for swarm robotics applications like self-reconfiguration and marching where agents must stay connected at all times.Method: Two approaches: 1) Integer Linear Programming (ILP) reduction for makespan-optimal plans, and 2) PULL algorithm - a rule-based one-step function that computes subsequent configurations preserving connectivity while advancing toward targets.
Result: ILP provides optimal solutions but scales poorly. PULL runs in O(n²) time per step, can handle hundreds of agents quickly, and substantially outperforms naive approaches.
Conclusion: PULL offers a practical, scalable solution for CUMAPF that enables real-time planning for swarm robotics applications requiring connectivity constraints.
Abstract: We consider Connected Unlabeled Multi-Agent Pathfinding (CUMAPF), a variant of MAPF where interchangeable agents must be connected at all times. This problem is fundamental to swarm robotics applications such as self-reconfiguration and marching, where standard MAPF is insufficient as it does not guarantee the connectivity constraint. Despite its simple structure, CUMAPF remains understudied and lacks practical algorithms. We first develop an Integer Linear Programming (ILP) reduction to solve CUMAPF. Although this formulation provides a makespan-optimal plan, it is severely limited in terms of scalability and real-time responsiveness due to the large number of variables. We therefore propose a suboptimal but complete algorithm named PULL. It is based on a rule-based one-step function that computes a subsequent configuration that preserves connectivity and advances towards the target configuration. PULL is lightweight, and runs in $O(n^2)$ time per step in a 2D grid, where $n$ is the number of agents. Empirically, PULL can quickly solve randomly generated instances containing hundreds of agents, which ILP cannot handle. Furthermore, PULL’s solution substantially improves upon a naive approach to CUMAPF.
[601] Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation
Yuhong Luo, Daniel Schoepflin, Xintong Wang
Main category: cs.MA
TL;DR: This paper studies algorithmic collusion risk in pricing games using a meta-game framework with pretrained policies and in-game adaptation, evaluating RL, UCB, and LLM-based strategies under symmetric/asymmetric cost settings.
Details
Motivation: The paper addresses the ongoing debate about algorithmic collusion risk and regulatory intervention, noting limitations in existing evaluations that rely on long learning horizons, assumptions about counterparty rationality, and symmetry in hyperparameters/economic settings among players.Method: Introduces a meta-game design where agents have pretrained policies with distinct strategic characteristics (competitive, naively cooperative, or robustly collusive). Formulates the problem as selecting a meta-strategy combining pretrained initial policy with in-game adaptation rule. Samples normal-form empirical games over meta-strategy profiles, computes game statistics (payoffs, regret), and constructs empirical best-response graphs to uncover strategic relationships.
Result: Evaluates reinforcement-learning, UCB, and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, presenting findings on the feasibility of algorithmic collusion and effectiveness of pricing strategies in practical “test-time” environments.
Conclusion: Provides insights into algorithmic collusion risk under rational choices and how agents co-adapt toward cooperation or competition, with implications for regulatory considerations in algorithmic pricing environments.
Abstract: The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about counterparty rationality in adopting collusive strategies, and symmetry in hyperparameters and economic settings among players. To study collusion risk, we introduce a meta-game design for analyzing algorithmic behavior under test-time constraints. We model agents as possessing pretrained policies with distinct strategic characteristics (e.g., competitive, naively cooperative, or robustly collusive), and formulate the problem as selecting a meta-strategy that combines a pretrained, initial policy with an in-game adaptation rule. We seek to examine whether collusion can emerge under rational choices and how agents co-adapt toward cooperation or competition. To this end, we sample normal-form empirical games over meta-strategy profiles, compute relevant game statistics (e.g., payoffs against individuals and regret against an equilibrium mixture of opponents), and construct empirical best-response graphs to uncover strategic relationships. We evaluate reinforcement-learning, UCB, and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, and present findings on the feasibility of algorithmic collusion and the effectiveness of pricing strategies in practical ``test-time’’ environments. The source code is available at: https://github.com/chailab-rutgers/CollusionMetagame.
[602] Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning (Extended Version)
Nhat D. A. Nguyen, Duong D. Nguyen, Gianluca Rizzo, Hung X. Nguyen
Main category: cs.MA
TL;DR: CB-MCTS introduces coordinated Boltzmann exploration to multi-agent MCTS for better performance in deceptive/sparse reward environments.
Details
Motivation: Decentralized MCTS struggles with sparse or skewed reward environments in cooperative multi-agent planning, needing better exploration strategies.Method: Replaces deterministic UCT with stochastic Boltzmann policy and decaying entropy bonus for sustained yet focused exploration in multi-agent systems.
Result: Outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks for multi-agent planning.
Conclusion: CB-MCTS provides a robust solution for multi-agent planning with improved exploration in challenging reward environments.
Abstract: Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
[603] Characterizations of voting rules based on majority margins
Yifeng Ding, Wesley H. Holliday, Eric Pacuit
Main category: cs.MA
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2501.08595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.08595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MM
[604] TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR Collaboration
Jiarun Song, Ninghao Wan, Fuzheng Yang, Weisi Lin
Main category: cs.MM
TL;DR: This paper proposes a Task-Aware Perceptual Interaction Fluency Model (TPIFM) for Remote Collaborative Augmented Reality (RCAR) that accounts for task-specific temporal sensitivity to network impairments like delay and stalling.
Details
Motivation: RCAR systems suffer from network impairments (delay/stalling) that degrade perceptual interaction fluency (PIF), but different tasks have different temporal sensitivity thresholds (JNDs). Current approaches don't account for task-specific tolerance to impairments.Method: Classify RCAR tasks by their just-noticeable difference (JND) thresholds, conduct controlled subjective experiments under delay/stalling/hybrid conditions, and develop TPIFM based on Free Energy Principle to model task-specific PIF.
Result: TPIFM accurately assesses PIF under network impairments, showing tasks with lower JNDs (stricter temporal demands) are more vulnerable to impairments, while higher JND tasks are more tolerant.
Conclusion: Task-aware modeling of perceptual interaction fluency enables adaptive RCAR design and user experience optimization under network constraints by accounting for task-specific temporal sensitivity.
Abstract: Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and stalling impairments under constrained network conditions. Perceptual interaction fluency (PIF), defined as the perceived pace and responsiveness of collaboration, is influenced not only by physical network impairments but also by intrinsic task characteristics. These characteristics can be interpreted as the task-specific just-noticeable difference (JND), i.e., the maximal tolerable temporal responsiveness before PIF degrades. When the average response time (ART), measured as the mean time per operation from receiving collaborator feedback to initiating the next action, falls within the JND, PIF is generally sustained, whereas values exceeding it indicate disruption. Tasks differ in their JNDs, reflecting distinct temporal responsiveness demands and sensitivities to impairments. From the perspective of the Free Energy Principle (FEP), tasks with lower JNDs impose stricter temporal prediction demands, making PIF more vulnerable to impairments, whereas higher JNDs allow greater tolerance. On this basis, we classify RCAR tasks by JND and evaluate their PIF through controlled subjective experiments under delay, stalling, and hybrid conditions. Building on these findings, we propose the Task-Aware Perceptual Interaction Fluency Model (TPIFM). Experimental results show that TPIFM accurately assesses PIF under network impairments, providing guidance for adaptive RCAR design and user experience optimization under network constraints.
[605] Latency Effects on Multi-Dimensional QoE in Networked VR Whiteboards
Jiarun Song, Yongkang Hou, Fuzheng Yang
Main category: cs.MM
TL;DR: Study examines how latency affects Quality of Experience in networked VR whiteboards across collaboration modes and platform types
Details
Motivation: Network latency in VR whiteboards degrades collaborative experience; need systematic understanding of how latency impacts different QoE dimensions across collaboration modes and platformsMethod: Classified QoE into pragmatic/hedonic dimensions, conducted controlled experiments to identify latency-sensitive sub-dimensions, compared sequential vs free collaboration modes, VR with/without avatars, and PC baseline
Result: Identified specific QoE sub-dimensions most affected by latency, found variations across collaboration modes and platform types, with VR+ (with avatars) showing different sensitivity patterns
Conclusion: Provides comprehensive framework for understanding latency impact on VR whiteboard QoE, offers practical guidance for system optimization under real-world constraints
Abstract: Networked virtual reality (NVR) whiteboards are increasingly important for enabling geographically dispersed users to engage in real-time idea sharing, collaborative design, and discussion. However, latency caused by network limitations, rendering delays, or synchronization issues can significantly degrade the Quality of Experience (QoE) in whiteboard collaboration. To systematically investigate the impact of latency, this study classified QoE into pragmatic and hedonic aspects, each comprising multiple sub-dimensions. Controlled experiments were conducted to identify the sub-dimensions most affected by latency, which were then adopted as the primary QoE indicators, with the aim of uncovering the processes and mechanisms through which latency shapes QoE. Building on this, we further examined how these impacts vary across different collaboration modes, namely sequential collaboration (SC) for structured design workflows and free collaboration (FC) for open discussion. We also compared two VR whiteboard types, one with avatars (VR+) and the other without avatars (VR), and included a traditional PC-based whiteboard as a baseline. This multi-dimensional design enables a comprehensive evaluation of latency’s impact on QoE across collaboration modes and platforms, providing practical guidance for optimizing NVR whiteboard systems under real-world network and system constraints.
[606] MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning
Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong Mo
Main category: cs.MM
TL;DR: MORE-R1 introduces explicit stepwise reasoning with Reinforcement Learning to enable Large Vision-Language Models to effectively perform Multimodal Object-Entity Relation Extraction, achieving state-of-the-art performance.
Details
Motivation: Existing methods for Multimodal Object-Entity Relation Extraction (MORE) struggle with complex extraction scenarios, limited scalability, and lack of intermediate reasoning transparency. Current approaches are mainly classification-based or generation-based without explicit reasoning capabilities.Method: Proposes MORE-R1 with a two-stage training process: 1) Initial cold-start training with Supervised Fine-Tuning using automatically constructed dataset with fine-grained stepwise reasoning, 2) Reinforcement Learning stage using Group Relative Policy Optimization with Progressive Sample-Mixing Strategy to enhance reasoning on hard samples.
Result: Comprehensive experiments on the MORE benchmark demonstrate state-of-the-art performance with significant improvement over baselines.
Conclusion: MORE-R1 effectively addresses the challenges of MORE task by introducing explicit stepwise reasoning with RL, improving both performance and reasoning transparency.
Abstract: Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model’s reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines.
[607] Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Main category: cs.MM
TL;DR: CDGLT is a training-efficient framework for multimodal metaphor identification using concept drift via SLERP interpolation and adapted LN tuning, achieving SOTA on MET-Meme benchmark with reduced computational costs.
Details
Motivation: Multimodal metaphors in internet memes present unique challenges due to unconventional expressions and implied meanings. Existing methods struggle with bridging literal-figurative gaps, and generative approaches have high computational costs.Method: CDGLT uses Concept Drift via SLERP interpolation of CLIP embeddings to generate divergent concept embeddings, plus prompt construction and adapted LayerNorm tuning for efficient multimodal metaphor identification.
Result: Achieves state-of-the-art performance on MET-Meme benchmark while significantly reducing training costs compared to existing generative methods.
Conclusion: CDGLT represents a significant step toward efficient and accurate multimodal metaphor understanding, demonstrating effectiveness of both Concept Drift and adapted LN tuning.
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
[608] Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao
Main category: cs.MM
TL;DR: First formal framework for Audio-Visual World Models (AVWM) that jointly captures binaural spatial audio and visual dynamics with action control, using a novel Conditional Diffusion Transformer architecture and a new 30-hour dataset.
Details
Motivation: Existing world models focus primarily on visual observations, but real-world perception involves multiple sensory modalities. Audio provides crucial spatial and temporal cues (sound source localization, acoustic scene properties) that are largely unexplored in world modeling. No prior work has formally defined audio-visual world models or how to jointly capture binaural spatial audio and visual dynamics under precise action control.Method: 1) Formal framework for Audio-Visual World Models as partially observable Markov decision process with synchronized audio-visual observations. 2) Constructed AVW-4k dataset with 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. 3) Proposed AV-CDiT (Audio-Visual Conditional Diffusion Transformer) with novel modality expert architecture balancing visual and auditory learning, optimized through three-stage training strategy for effective multimodal integration.
Result: AV-CDiT achieves high-fidelity multimodal prediction across both visual and auditory modalities. Practical validation in continuous audio-visual navigation tasks shows AVWM significantly enhances agent performance.
Conclusion: This work presents the first formal framework for Audio-Visual World Models, demonstrating the importance of integrating audio with visual observations for more comprehensive environment simulation and improved agent performance in multimodal tasks.
Abstract: World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains largely unexplored. No prior work has formally defined what constitutes an audio-visual world model or how to jointly capture binaural spatial audio and visual dynamics under precise action control. This work presents the first formal framework for Audio-Visual World Models (AVWM), formulating multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. To address the lack of suitable training data, we construct AVW-4k, a dataset comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in continuous audio-visual navigation tasks, where AVWM significantly enhances the agent’s performance.
eess.AS
[609] Universal Speech Content Factorization
Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner
Main category: eess.AS
TL;DR: USCF is a linear method for extracting low-rank speech representations that suppress speaker timbre while preserving phonetic content, enabling zero-shot voice conversion and timbre-disentangled speech features for text-to-speech models.
Details
Motivation: The paper aims to develop a universal method for disentangling speaker timbre from phonetic content in speech, enabling applications like zero-shot voice conversion and timbre-prompted text-to-speech without requiring extensive target-speaker data or additional neural training.Method: USCF extends Speech Content Factorization to an open-set setting using least-squares optimization to learn a universal speech-to-content mapping. It derives speaker-specific transformations from only a few seconds of target speech, creating a simple and invertible linear method for extracting low-rank speech representations.
Result: USCF effectively removes speaker-dependent variation according to embedding analysis. As a zero-shot VC system, it achieves competitive intelligibility, naturalness, and speaker similarity compared to methods requiring more target-speaker data. USCF features also work well as acoustic representations for training timbre-prompted text-to-speech models.
Conclusion: USCF provides an effective linear method for speech content factorization that enables practical zero-shot voice conversion and serves as useful timbre-disentangled features for speech generation tasks, with advantages in data efficiency and computational simplicity.
Abstract: We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.
[610] Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition
Jordan Prescott, Thanathai Lertpetchpun, Shrikanth Narayanan
Main category: eess.AS
TL;DR: Adversarial robustness in ASR systems shows non-monotonic trade-off with audio codec quantization depth - intermediate depths balance content preservation and perturbation suppression best.
Details
Motivation: Adversarial perturbations exploit ASR vulnerabilities while preserving human-perceived content. Neural audio codecs with discrete bottlenecks can suppress adversarial noise, but the optimal quantization granularity for balancing robustness and content preservation is unclear.Method: Examine how residual vector quantization (RVQ) depth in neural audio codecs shapes adversarial robustness. Study trade-offs under gradient-based attacks by varying quantization granularity and analyzing adversarial effects on discrete codebook tokens.
Result: Shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. Adversarially induced changes in codebook tokens strongly correlate with transcription error.
Conclusion: Neural audio codec configurations, particularly at intermediate quantization depths, provide effective defense against adversarial attacks on ASR systems, outperforming traditional compression methods even under adaptive attacks.
Abstract: Adversarial perturbations exploit vulnerabilities in automatic speech recognition (ASR) systems while preserving human perceived linguistic content. Neural audio codecs impose a discrete bottleneck that can suppress fine-grained signal variations associated with adversarial noise. We examine how the granularity of this bottleneck, controlled by residual vector quantization (RVQ) depth, shapes adversarial robustness. We observe a non-monotonic trade-off under gradient-based attacks: shallow quantization suppresses adversarial perturbations but degrades speech content, while deeper quantization preserves both content and perturbations. Intermediate depths balance these effects and minimize transcription error. We further show that adversarially induced changes in discrete codebook tokens strongly correlate with transcription error. These gains persist under adaptive attacks, where neural codec configurations outperform traditional compression defenses.
[611] Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models
Haoyuan Yang, Mu Yang, Jiamin Xie, Szu-Jui Chen, John H. L. Hansen
Main category: eess.AS
TL;DR: Emotion-Aware Prefix improves zero-shot voice conversion with explicit emotion control, achieving 85.5% emotion conversion accuracy while preserving linguistic integrity and speaker identity.
Details
Motivation: Current zero-shot voice conversion methods have limited expressive capacity for emotion control, resulting in suboptimal or inconsistent performance. There's a need for better emotion control while maintaining linguistic integrity and speaker identity.Method: Proposes Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. Uses joint control of both sequence modulation and acoustic realization to synthesize distinct emotions.
Result: Doubles baseline Emotion Conversion Accuracy from 42.40% to 85.50% while maintaining linguistic integrity and speech quality without compromising speaker identity. Shows generalizability through comparative analysis.
Conclusion: Joint control of sequence modulation and acoustic realization is essential for emotion synthesis. The method provides insights on acoustic decoupling’s role in maintaining speaker identity during emotion conversion.
Abstract: Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.
[612] Acoustic and Semantic Modeling of Emotion in Spoken Language
Soumya Dutta
Main category: eess.AS
TL;DR: Thesis on multimodal emotion understanding and generation from speech, focusing on acoustic-semantic joint modeling for emotion recognition and speech-to-speech emotion style transfer.
Details
Motivation: Emotions are crucial for human communication and AI integration into daily life. While emotional expression is multimodal, this work focuses on spoken language to address the challenge of enabling AI systems to reliably understand and generate human emotions through acoustic and semantic information.Method: Three-part approach: 1) Emotion-aware representation learning through pre-training with acoustic and semantic supervision, including speech-driven supervised pre-training for large-scale emotion-aware text modeling; 2) Hierarchical architectures with cross-modal attention and mixture-of-experts fusion for emotion recognition in conversations; 3) Textless, non-parallel speech-to-speech framework for emotion style transfer with controllable transformations while preserving speaker identity and linguistic content.
Result: Demonstrated improved emotion transfer and showed that style-transferred speech can be effectively used for data augmentation to improve emotion recognition performance.
Conclusion: The thesis advances multimodal emotion understanding and generation from speech through joint acoustic-semantic modeling, with applications in emotion recognition and controllable speech-to-speech emotion style transfer, contributing to more emotionally intelligent AI systems.
Abstract: Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.
[613] StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement
Xiaobin Rong, Jun Gao, Zheng Wang, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu
Main category: eess.AS
TL;DR: StuPASE improves generative speech enhancement by achieving studio-level quality while maintaining low hallucination through dry target finetuning and flow-matching replacement.
Details
Motivation: Current generative speech enhancement methods face a trade-off between perceptual quality and hallucination. PASE is robust to hallucination but has limited perceptual quality under adverse conditions, while other methods may produce hallucinations.Method: 1) Finetune PASE with dry targets instead of targets containing simulated early reflections to improve dereverberation. 2) Replace the GAN-based generative module in PASE with a flow-matching module to handle strong additive noise and enable studio-quality generation.
Result: StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art speech enhancement methods.
Conclusion: The proposed StuPASE successfully addresses the quality-hallucination trade-off in generative speech enhancement, achieving studio-level quality while retaining low hallucination properties.
Abstract: Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achieve studio-level quality while retaining its low-hallucination property. First, we show that finetuning PASE with dry targets rather than targets containing simulated early reflections substantially improves dereverberation. Second, to address performance limitations under strong additive noise, we replace the GAN-based generative module in PASE with a flow-matching module, enabling studio-quality generation even under highly challenging conditions. Experiments demonstrate that StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art SE methods. Audio demos are available at: https://xiaobin-rong.github.io/stupase_demo/.
[614] End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments
Rui Wang, Zhifei Zhang, Yu Gao, Xiaofeng Mou, Yi Xu
Main category: eess.AS
TL;DR: End-to-end multi-channel keyword spotting framework using spatial cues for noise robustness, combining spatial encoder with directional priors for improved performance in noisy environments.
Details
Motivation: Robust keyword spotting in noisy environments is challenging with conventional single-channel systems using cascaded pipelines that prevent joint optimization, limiting performance.Method: Proposes an end-to-end multi-channel KWS framework with spatial encoder learning inter-channel features, spatial embedding injecting directional priors, and fused representation processed by streaming backbone.
Result: Experiments in simulated noisy conditions across multiple SNRs show spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving best results.
Conclusion: Validates end-to-end multi-channel spatial modeling, indicating strong potential for target-speaker-aware detection in complex acoustic scenarios.
Abstract: Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling, indicating strong potential for the target-speaker-aware detection in complex acoustic scenarios.
[615] A Fast Solver for Interpolating Stochastic Differential Equation Diffusion Models for Speech Restoration
Bunlong Lay, Timo Gerkmann
Main category: eess.AS
TL;DR: A fast sampling solver for interpolating Stochastic Differential Equations (iSDEs) that enables efficient speech restoration with as few as 10 neural network evaluations.
Details
Motivation: Existing fast sampling solvers for diffusion models like DPMs don't work for conditional diffusion models like SGMSE+ due to different diffusion processes. SGMSE+ interpolates between target distribution and noisy observation rather than transforming to Gaussian, requiring new fast sampling methods.Method: Develops a formalism of interpolating Stochastic Differential Equations (iSDEs) that includes SGMSE+, then proposes a specialized solver for iSDEs that enables fast sampling with minimal neural network evaluations.
Result: The proposed solver achieves fast sampling with as few as 10 neural network evaluations across multiple speech restoration tasks, significantly improving computational efficiency.
Conclusion: The iSDE formalism and proposed solver enable efficient fast sampling for conditional diffusion models like SGMSE+, making them more practical for real-world speech restoration applications.
Abstract: Diffusion Probabilistic Models (DPMs) are a well-established class of diffusion models for unconditional image generation, while SGMSE+ is a well-established conditional diffusion model for speech enhancement. One of the downsides of diffusion models is that solving the reverse process requires many evaluations of a large Neural Network. Although advanced fast sampling solvers have been developed for DPMs, they are not directly applicable to models such as SGMSE+ due to differences in their diffusion processes. Specifically, DPMs transform between the data distribution and a standard Gaussian distribution, whereas SGMSE+ interpolates between the target distribution and a noisy observation. This work first develops a formalism of interpolating Stochastic Differential Equations (iSDEs) that includes SGMSE+, and second proposes a solver for iSDEs. The proposed solver enables fast sampling with as few as 10 Neural Network evaluations across multiple speech restoration tasks.
[616] Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen
Main category: eess.AS
TL;DR: Speech-Omni-Lite: A lightweight framework that adds speech understanding and generation to frozen visual-language backbones using plug-and-play modules, achieving strong performance with minimal speech data.
Details
Motivation: Existing multimodal models require massive data and computational resources. There's a need for cost-efficient methods to extend visual-language models with speech capabilities without compromising their original performance or requiring extensive retraining.Method: Uses frozen visual-language backbone with two lightweight trainable modules: speech projector and speech token generator. Introduces QTATS data construction strategy to generate Question-Text Answer-Text-Speech data from existing ASR pairs for speech generation training.
Result: Achieves excellent spoken QA performance comparable to models trained on millions of hours of speech data, using only thousands of hours. Speech modules show strong transferability across different VL backbones.
Conclusion: Speech-Omni-Lite provides an efficient, cost-effective approach to extend multimodal models with speech capabilities while preserving original vision-language performance and requiring minimal additional training data.
Abstract: While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones’ vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.
[617] Finetuning a Text-to-Audio Model for Room Impulse Response Generation
Kirak Kim, Sungyoung Kim
Main category: eess.AS
TL;DR: Fine-tuning text-to-audio models for Room Impulse Response generation using vision-language models to create text-RIR pairs and enabling free-form user prompts through in-context learning.
Details
Motivation: Real-world RIR acquisition is labor-intensive, creating data scarcity for data-driven RIR generation approaches, while existing methods lack the ability to leverage large-scale generative audio priors.Method: Fine-tune pre-trained text-to-audio models for RIR generation; use vision-language models to extract acoustic descriptions from image-RIR datasets to create text-RIR pairs; implement in-context learning for free-form user prompts during inference.
Result: Model generates plausible RIRs as shown by MUSHRA listening tests and effectively serves as a speech data augmentation tool, improving downstream ASR performance.
Conclusion: Large-scale generative audio priors can be effectively leveraged for RIR generation, and vision-language models provide a viable solution for creating text-RIR paired data where it’s lacking.
Abstract: Room Impulse Responses (RIRs) enable realistic acoustic simulation, with applications ranging from multimedia production to speech data augmentation. However, acquiring high-quality real-world RIRs is labor-intensive, and data scarcity remains a challenge for data-driven RIR generation approaches. In this paper, we propose a novel approach to RIR generation by fine-tuning a pre-trained text-to-audio model, demonstrating for the first time that large-scale generative audio priors can be effectively leveraged for the task. To address the lack of text-RIR paired data, we establish a labeling pipeline utilizing vision-language models to extract acoustic descriptions from existing image-RIR datasets. We introduce an in-context learning strategy to accommodate free-form user prompts during inference. Evaluations involving MUSHRA listening tests and downstream ASR performance demonstrate that our model generates plausible RIRs and serves as an effective tool for speech data augmentation.
[618] A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition
Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette Scharenborg
Main category: eess.AS
TL;DR: DRES is a 1.5-hour Dutch realistic speech dataset recorded in noisy public environments, used to evaluate speech enhancement and ASR models in real-world conditions.
Details
Motivation: To create a realistic test set for evaluating speech enhancement and automatic speech recognition models in challenging real-world scenarios with background noise and talkers, addressing the gap between controlled lab conditions and actual deployment environments.Method: Collected 1.5 hours of Dutch semi-spontaneous speech from 80 speakers in noisy public indoor environments using a four-channel linear microphone array, then evaluated five single-channel speech enhancement algorithms and eight state-of-the-art ASR models on this dataset.
Result: Five out of eight ASR models achieved WERs below 22% on the challenging DRES dataset, but contrary to recent work, modern single-channel speech enhancement did not improve ASR performance, highlighting the importance of realistic evaluation conditions.
Conclusion: Realistic evaluation datasets like DRES are crucial for proper assessment of speech technologies, as they reveal that speech enhancement algorithms that work well in controlled conditions may not benefit ASR performance in real-world noisy environments.
Abstract: We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on ASR performance, emphasizing the importance of evaluating in realistic conditions.
[619] Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks
Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Pourya Behmandpoor, Henri Gode, Marc Moonen
Main category: eess.AS
TL;DR: Proposes a distributed multichannel Wiener filter (dMWF) for wireless acoustic sensor networks that is non-iterative and optimal even when nodes observe different sets of sources, outperforming existing iterative methods like DANSE.
Details
Motivation: Existing distributed algorithms for speech signal estimation in wireless acoustic sensor networks are iterative (slow/practical) and assume all nodes observe the same sources, which is often not the case in practice. Need for non-iterative, optimal solution that handles different source observations.Method: Proposes dMWF algorithm where nodes exchange neighbor-pair-specific, low-dimensional fused signals estimating the contribution of sources observed by both nodes in the pair. Non-iterative approach for fully connected networks.
Result: Formally proves optimality of dMWF and demonstrates in simulated speech enhancement experiments that it outperforms DANSE in objective metrics after short operation times, highlighting benefits of iterationless design.
Conclusion: dMWF provides optimal distributed speech signal estimation without iterations, handles different source observations across nodes, and reduces communication bandwidth while matching centralized performance.
Abstract: In a wireless acoustic sensor network (WASN), devices (i.e., nodes) can collaborate through distributed algorithms to collectively perform audio signal processing tasks. This paper focuses on the distributed estimation of node-specific desired speech signals using network-wide Wiener filtering. The objective is to match the performance of a centralized system that would have access to all microphone signals, while reducing the communication bandwidth usage of the algorithm. Existing solutions, such as the distributed adaptive node-specific signal estimation (DANSE) algorithm, converge towards the multichannel Wiener filter (MWF) which solves a centralized linear minimum mean square error (LMMSE) signal estimation problem. However, they do so iteratively, which can be slow and impractical. Many solutions also assume that all nodes observe the same set of sources of interest, which is often not the case in practice. To overcome these limitations, we propose the distributed multichannel Wiener filter (dMWF) for fully connected WASNs. The dMWF is non-iterative and optimal even when nodes observe different sets of sources. In this algorithm, nodes exchange neighbor-pair-specific, low-dimensional (fused) signals estimating the contribution of sources observed by both nodes in the pair. We formally prove the optimality of dMWF and demonstrate its performance in simulated speech enhancement experiments. The proposed algorithm is shown to outperform DANSE in terms of objective metrics after short operation times, highlighting the benefit of its iterationless design.
[620] Textless and Non-Parallel Speech-to-Speech Emotion Style Transfer
Soumya Dutta, Avni Jain, Sriram Ganapathy
Main category: eess.AS
TL;DR: S2S-ZEST: A zero-shot speech-to-speech emotion style transfer framework that preserves source content/speaker while transferring reference emotion characteristics using analysis-synthesis pipeline with semantic tokens, speaker representations, and emotion embeddings.
Details
Motivation: Speech-to-speech emotion style transfer requires generating output speech that mimics reference emotion while preserving source content and speaker identity. Existing methods may not effectively handle zero-shot scenarios or maintain proper separation of these attributes.Method: Analysis-synthesis pipeline: analysis module extracts semantic tokens (content), speaker representations, and emotion embeddings; pitch contour estimator and duration predictor are learned; synthesis module generates speech from these representations; trained with auto-encoding objective for efficient resynthesis.
Result: Improved emotion style transfer performance over prior methods in textless and non-parallel settings; effective content and speaker preservation from source; successful emotion transfer from reference; demonstrated application for data augmentation in emotion recognition tasks.
Conclusion: S2S-ZEST enables effective zero-shot speech emotion style transfer while preserving content and speaker identity, with applications in speech generation and emotion recognition data augmentation.
Abstract: Given a pair of source and reference speech recordings, speech-to-speech (S2S) emotion style transfer involves the generation of an output speech that mimics the emotion characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a speech-to-speech zero-shot emotion style transfer framework, termed S2S Zero-shot Emotion Style Transfer (S2S-ZEST), that enables the transfer of emotional attributes from the reference to the source while retaining the speaker identity and speech content. The S2S-ZEST framework consists of an analysis-synthesis pipeline in which the analysis module extracts semantic tokens, speaker representations, and emotion embeddings from speech. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. The analysis-synthesis pipeline is trained using an auto-encoding objective to enable efficient resynthesis during inference. For S2S emotion style transfer, the emotion embedding extracted from the reference speech along with the remaining representations from the source speech are used in the synthesis module to generate the style-transferred speech. In our experiments, we evaluate the converted speech on content and speaker preservation (with respect to the source) as well as on the effectiveness of the emotion style transfer (with respect to the reference). The proposed framework demonstrates improved emotion style transfer performance over prior methods in a textless and non-parallel setting. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.
[621] Fast-Converging Distributed Signal Estimation in Topology-Unconstrained Wireless Acoustic Sensor Networks
Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Marc Moonen
Main category: eess.AS
TL;DR: TI-DANSE+ improves distributed signal estimation in wireless acoustic sensor networks by using partial neighbor sums for faster convergence while maintaining topology independence and reducing bandwidth usage.
Details
Motivation: Existing TI-DANSE algorithm for distributed signal estimation in wireless acoustic sensor networks suffers from slow convergence due to limited access to only the in-network sum of all fused signals, limiting its real-world applicability.Method: Proposes TI-DANSE+ where updating nodes separately use partial in-network sums of fused signals from each neighbor, combined with tree-pruning strategy to maximize available neighbors and degrees of freedom for faster convergence.
Result: TI-DANSE+ achieves convergence speed comparable to original DANSE in fully connected networks while using peer-to-peer transmission instead of broadcasting, preserves convergence under link failures, and reduces communication bandwidth usage.
Conclusion: TI-DANSE+ serves as an all-round alternative that merges advantages of DANSE and TI-DANSE, reconciles their differences, and offers additional benefits in communication efficiency for distributed signal estimation in wireless acoustic sensor networks.
Abstract: This paper focuses on distributed signal estimation in topology-unconstrained wireless acoustic sensor networks (WASNs) where sensor nodes only transmit fused versions of their local sensor signals. For this task, the topology-independent (TI) distributed adaptive node-specific signal estimation (DANSE) algorithm (TI-DANSE) has previously been proposed. It converges towards the centralized signal estimation solution in non-fully connected and time-varying network topologies. However, the applicability of TI-DANSE in real-world scenarios is limited due to its slow convergence. The latter results from the fact that, in TI-DANSE, nodes only have access to the in-network sum of all fused signals in the WASN. We address this low convergence speed by introducing an improved TI-DANSE algorithm, referred to as TI-DANSE+, in which updating nodes separately use the partial in-network sums of fused signals coming from each of their neighbors. Nodes can maximize the number of available degrees of freedom in their local optimization problem, leading to faster convergence. This is further exploited by combining TI-DANSE+ with a tree-pruning strategy that maximizes the number of neighbors at the updating node. In fully connected WASNs, TI-DANSE+ converges as fast as the original DANSE algorithm (the latter only defined for fully connected WASNs) while using peer-to-peer data transmission instead of broadcasting and thus saving communication bandwidth. If link failures occur, the convergence of TI-DANSE+ towards the centralized solution is preserved without any change in its formulation. Altogether, the proposed TI-DANSE+ algorithm can be viewed as an all-round alternative to DANSE and TI-DANSE which (i) merges the advantages of both, (ii) reconciliates their differences into a single formulation, and (iii) shows advantages of its own in terms of communication bandwidth usage.
[622] Human-CLAP: Human-perception-based contrastive language-audio pretraining
Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi Saruwatari
Main category: eess.AS
TL;DR: Human-CLAP improves audio-text relevance evaluation by training CLAP on human subjective scores, achieving 0.25+ SRCC improvement over conventional CLAP
Details
Motivation: CLAPScore is widely used for evaluating text-to-audio relevance but has low correlation with human subjective evaluation scores, limiting its effectiveness as a perceptual metricMethod: Propose Human-CLAP by training a contrastive language-audio model using human subjective evaluation scores to align the embedding space with human perception
Result: Human-CLAP improves Spearman’s rank correlation coefficient between CLAPScore and subjective evaluation scores by more than 0.25 compared to conventional CLAP
Conclusion: Training CLAP on human perceptual data significantly improves its correlation with human judgment, making it a more reliable metric for audio-text relevance evaluation
Abstract: Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman’s rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.
[623] Benchmarking Humans and Machines on Complex Multilingual Speech Understanding Tasks
Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy
Main category: eess.AS
TL;DR: Humans show better selective auditory attention in their native language than second language, while speech-based LLMs excel in clean speech but struggle with selective attention in multi-speaker settings like humans do.
Details
Motivation: To understand how humans and machines process speech in complex acoustic scenes (cocktail party settings) in multilingual contexts, particularly comparing native vs. second language performance and machine vs. human capabilities.Method: Proposed a systematic paradigm for studying humans and machines in speech question-answering tasks with clean and mixed-channel speech in multilingual settings, comparing native language (L1) vs. second language (L2) performance.
Result: Human listeners showed significantly better selective attention to target speakers in L1 than L2. Speech-based LLMs matched or exceeded human performance in clean, single-speaker conditions but struggled with selective attention in two-speaker settings.
Conclusion: Humans rely on language-specific attentional cues that are more efficient in native language, while LLMs use parallel information extraction that works well for clean speech but lacks human-like selective attention capabilities in complex acoustic scenes.
Abstract: Auditory attention and selective phase-locking are central to human speech understanding in complex acoustic scenes and cocktail party settings, yet these capabilities in multilingual subjects remain poorly understood. While machine understanding of natural speech has advanced in recent years, questions persist about comprehension of overlapped and mixed-channel speech. We propose a systematic paradigm for studying humans and machines in speech question-answering tasks in multilingual settings with clean and mixed-channel speech. For human listeners, selective attention to a target speaker was significantly better in their native language (L1) than in their second language (L2). For machine listening, speech-based large language models (LLMs) match or exceed human performance in clean, single-speaker conditions but often struggle to selectively attend in two-speaker settings. These results reveal a key divergence: humans rely on attentional cues that are more streamlined in their native language, whereas LLMs default to parallel information extraction which exceed human skills.
[624] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
Main category: eess.AS
TL;DR: VSSFlow is a unified flow-matching framework that solves both Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS) generation using a Diffusion Transformer with disentangled condition aggregation.
Details
Motivation: Video-conditioned audio generation has been treated as separate tasks (V2S and VisualTTS), leaving the potential for a unified generative framework largely unexplored. The paper aims to bridge this gap.Method: Proposes VSSFlow, a unified flow-matching framework using Diffusion Transformer (DiT) architecture with disentangled condition aggregation: cross-attention for semantic conditions and self-attention for temporally-intensive conditions. Also uses feature-level data synthesis for joint sound and speech generation.
Result: Extensive experiments show VSSFlow effectively unifies V2S and VisualTTS tasks, surpasses state-of-the-art domain-specific baselines, and maintains superior performance during end-to-end joint learning, contrary to prevailing beliefs about performance degradation.
Conclusion: VSSFlow demonstrates the critical potential of unified generative models for video-conditioned audio generation, providing a robust foundation that adapts to joint sound and speech generation using synthetic data.
Abstract: Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/
[625] Evaluating pretrained speech embedding systems for dysarthria detection across heterogenous datasets
Lovisa Wihlborg, Jemima Goodall, David Wheatley, Jacob J. Webber, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Sohan Seth, Oliver Watts, Cassia Valentini-Botinhao
Main category: eess.AS
TL;DR: Evaluation of 17 pretrained speech embedding systems for dysarthric speech detection across 6 datasets, analyzing within-dataset and cross-dataset performance with statistical validation against chance.
Details
Motivation: Dysarthric speech datasets are often small, imbalanced, and suffer from recording biases, making reliable detection challenging. There's a need to evaluate how well pretrained speech embeddings generalize across different datasets and conditions for clinical validity.Method: Evaluated 17 publicly available speech embedding systems across 6 dysarthric speech datasets using cross-validation. Implemented statistical validation against carefully crafted null hypothesis to ensure results are above chance. Analyzed both within-dataset performance and cross-dataset generalization (training on one dataset, testing on another).
Result: Within-dataset results varied considerably depending on dataset regardless of embedding used. Cross-dataset accuracy was lower than within-dataset, highlighting generalization challenges. Findings raise questions about which datasets should be used for benchmarking and have implications for clinical validity.
Conclusion: Pretrained speech embeddings show variable performance for dysarthric speech detection, with significant dataset dependence and limited cross-dataset generalization. This questions the clinical validity of systems trained and tested on the same dataset alone.
Abstract: We present a comprehensive evaluation of pretrained speech embedding systems for the detection of dysarthric speech using existing accessible data. Dysarthric speech datasets are often small and can suffer from recording biases as well as data imbalance. To address these we selected a range of datasets covering related conditions and adopt the use of several cross-validations runs to estimate the chance level. To certify that results are above chance, we compare the distribution of scores across these runs against the distribution of scores of a carefully crafted null hypothesis. In this manner, we evaluate 17 publicly available speech embedding systems across 6 different datasets, reporting the cross-validation performance on each. We also report cross-dataset results derived when training with one particular dataset and testing with another. We observed that within-dataset results vary considerably depending on the dataset, regardless of the embedding used, raising questions about which datasets should be used for benchmarking. We found that cross-dataset accuracy is, as expected, lower than within-dataset, highlighting challenges in the generalization of the systems. These findings have important implications for the clinical validity of systems trained and tested on the same dataset.
[626] WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion
Dong Liu, Juan Liu, Wei Ju, Yao Tian, Ming Li
Main category: eess.AS
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.01056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] Multiplexing Neural Audio Watermarks
Zheqi Yuan, Yucheng Huang, Guangzhi Sun, Zengrui Jin, Chao Zhang
Main category: eess.AS
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2511.02278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
eess.IV
[628] Robust Wildfire Forecasting under Partial Observability: From Reconstruction to Prediction
Chen Yang, Mehdi Zafari, Ziheng Duan, A. Lee Swindlehurst
Main category: eess.IV
TL;DR: Two-stage probabilistic framework for wildfire forecasting under partial observability: Stage-I reconstructs plausible fire maps from corrupted satellite observations using various inpainting models, Stage-II models wildfire dynamics on recovered sequences for spatiotemporal prediction.
Details
Motivation: Satellite-derived fire observations are incomplete due to cloud cover, smoke obscuration, and sensor artifacts, creating a domain gap between clean training data and degraded deployment inputs that leads to unreliable wildfire predictions.Method: Two-stage probabilistic framework: Stage-I uses conditional inpainting with four architectures (Residual U-Net, Conditional VAE, cross-attention Vision Transformer, discrete diffusion model) to reconstruct fire maps from corrupted observations. Stage-II uses spatiotemporal forecasting network on recovered sequences.
Result: Learning-based recovery models substantially outperform non-learning baselines, with MaskCVAE and MaskUNet achieving strongest performance. Reconstruction stage before forecasting significantly mitigates domain gap, restoring next-day prediction accuracy to near-clean-input levels even under severe information loss (10%-80% corruption).
Conclusion: Two-stage approach effectively addresses partial observability in wildfire forecasting by decoupling observation recovery from spatiotemporal prediction, with reconstruction stage crucial for bridging domain gap between clean training data and degraded deployment inputs.
Abstract: Satellite-derived fire observations are the primary input for learning-based wildfire spread prediction, yet they are inherently incomplete due to cloud cover, smoke obscuration, and sensor artifacts. This partial observability introduces a domain gap between the clean data used to train forecasting models and the degraded inputs encountered during deployment, often leading to unreliable predictions. To address this challenge, we formulate wildfire forecasting under partial observability using a two-stage probabilistic framework that decouples observation recovery from spatiotemporal prediction. Stage-I reconstructs plausible fire maps from corrupted observations via conditional inpainting, while Stage-II models wildfire dynamics on the recovered sequences using a spatiotemporal forecasting network. We consider four network architectures for the reconstruction module-a Residual U-Net (MaskUNet), a Conditional VAE (MaskCVAE), a cross-attention Vision Transformer (MaskViT), and a discrete diffusion model (MaskD3PM)-spanning CNN-based, latent-variable, attention-based, and diffusion-based approaches. We evaluate the performance of the two-stage approach on the WildfireSpreadTS (WSTS) dataset under various settings, including pixel-wise and block-wise masking, eight corruption levels (10%-80%), four fire scenarios, and leave-one-year-out cross-validation. Results show that all learning-based recovery models substantially outperform non-learning baselines, with MaskCVAE and MaskUNet achieving the strongest overall performance. Importantly, inserting the reconstruction stage before forecasting significantly mitigates the domain gap, restoring next-day prediction accuracy to near-clean-input levels even under severe information loss.
[629] M2Diff: Multi-Modality Multi-Task Enhanced Diffusion Model for MRI-Guided Low-Dose PET Enhancement
Ghulam Nabi Ahmad Hassan Yar, Himashi Peiris, Victoria Mar, Cameron Dennis Pain, Zhaolin Chen
Main category: eess.IV
TL;DR: M2Diff: A multi-modality multi-task diffusion model that separately processes MRI and low-dose PET scans to learn modality-specific features, then fuses them hierarchically to reconstruct standard-dose PET with improved fidelity.
Details
Motivation: Low-dose PET reduces radiation exposure but yields diminished quality. Existing methods use single-task models with multi-modal conditioning (PET/CT or PET/MRI), which may limit extraction of modality-specific features and cause early feature dilution. Challenges remain in effectively leveraging multi-modality inputs for reconstructing diverse features in heterogeneous patient populations.Method: Proposes M2Diff, a multi-modality multi-task diffusion model that processes MRI and LD PET scans separately to learn modality-specific features, then fuses them via hierarchical feature fusion to reconstruct SD PET. This enables effective integration of complementary structural (MRI) and functional (PET) information.
Result: M2Diff achieves superior qualitative and quantitative performance on both healthy and Alzheimer’s disease brain datasets, demonstrating improved reconstruction fidelity compared to previous approaches.
Conclusion: The proposed multi-modality multi-task diffusion model effectively integrates complementary structural and functional information from separate modality processing, leading to improved SD PET reconstruction from LD PET and MRI inputs.
Abstract: Positron emission tomography (PET) scans expose patients to radiation, which can be mitigated by reducing the dose, albeit at the cost of diminished quality. This makes low-dose (LD) PET recovery an active research area. Previous studies have focused on standard-dose (SD) PET recovery from LD PET scans and/or multi-modal scans, e.g., PET/CT or PET/MRI, using deep learning. While these studies incorporate multi-modal information through conditioning in a single-task model, such approaches may limit the capacity to extract modality-specific features, potentially leading to early feature dilution. Although recent studies have begun incorporating pathology-rich data, challenges remain in effectively leveraging multi-modality inputs for reconstructing diverse features, particularly in heterogeneous patient populations. To address these limitations, we introduce a multi-modality multi-task diffusion model (M2Diff) that processes MRI and LD PET scans separately to learn modality-specific features and fuse them via hierarchical feature fusion to reconstruct SD PET. This design enables effective integration of complementary structural and functional information, leading to improved reconstruction fidelity. We have validated the effectiveness of our model on both healthy and Alzheimer’s disease brain datasets. The M2Diff achieves superior qualitative and quantitative performance on both datasets.
[630] DFPF-Net: Dynamically Focused Progressive Fusion Network for Remote Sensing Change Detection
Chengming Wang, Peng Duan, Jinjiang Li
Main category: eess.IV
TL;DR: DFPF-Net is a change detection method for bi-temporal remote sensing images that combines pyramid vision transformer with dynamic attention mechanisms to handle both global and local noise in change detection.
Details
Motivation: Existing CNN-based methods struggle with pseudo changes across global scales, while transformers handle long-range dependencies but are sensitive to localized noise like building shadows under varying lighting conditions. Need to address both global and local noise influences simultaneously.Method: Proposes DFPF-Net with two main components: 1) Uses pyramid vision transformer (PVT) as weight-shared siamese network with residual-based progressive enhanced fusion module (PEFM) for multi-level feature fusion, and 2) Dynamic change focus module (DCFM) using attention mechanisms and edge detection to mitigate noise across varying ranges.
Result: Extensive experiments on four datasets show DFPF-Net outperforms mainstream change detection methods.
Conclusion: DFPF-Net effectively addresses both global and local noise challenges in change detection through its transformer-based architecture with dynamic attention mechanisms.
Abstract: Change detection (CD) has extensive applications and is a crucial method for identifying and localizing target changes. In recent years, various CD methods represented by convolutional neural network (CNN) and transformer have achieved significant success in effectively detecting difference areas in bi-temporal remote sensing images. However, CNN still exhibit limitations in local feature extraction when confronted with pseudo changes caused by different object types across global scales. Although transformers can effectively detect true change regions due to their long-range dependencies, the shadows cast by buildings under varying lighting conditions can introduce localized noise in these areas. To address these challenges, we propose the dynamically focused progressive fusion network (DFPF-Net) to simultaneously tackle global and local noise influences. On one hand, we utilize a pyramid vision transformer (PVT) as a weight-shared siamese network to implement change detection, efficiently fusing multi-level features extracted from the pyramid structure through a residual based progressive enhanced fusion module (PEFM). On the other hand, we propose the dynamic change focus module (DCFM), which employs attention mechanisms and edge detection algorithms to mitigate noise interference across varying ranges. Extensive experiments on four datasets demonstrate that DFPF-Net outperforms mainstream CD methods.
[631] MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
Yuxuan Liu, Wei Xu, Qi Guo
Main category: eess.IV
TL;DR: MetaSpectra+ is a compact multifunctional camera using metasurface-refractive assembly for snapshot HDR+hyperspectral or polarization+hyperspectral imaging across 250nm visible spectrum.
Details
Motivation: To overcome limitations of prior multifunctional metasurface imagers restricted to narrow bands (10-100nm) and develop a compact camera supporting multiple imaging modalities in a single snapshot.Method: Utilizes a novel metasurface-refractive assembly that splits incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization for two operating modes.
Result: Achieves shortest total track length and highest reconstruction accuracy on benchmark datasets compared to snapshot hyperspectral imagers; prototype reconstructs high-quality hyperspectral datacubes plus either HDR images or two orthogonal polarization channels.
Conclusion: MetaSpectra+ demonstrates a compact multifunctional imaging system with broad spectral coverage and multiple operating modes, advancing snapshot hyperspectral imaging capabilities.
Abstract: We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10-100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a single snapshot.
[632] CycleULM: A unified label-free deep learning framework for ultrasound localisation microscopy
Su Yan, Clara Rodrigo Gonzalez, Vincent C. H. Leung, Herman Verinaz-Jadan, Jiakang Chen, Matthieu Toulemonde, Kai Riemer, Jipeng Yan, Clotilde Vié, Qingyuan Tan, Peter D. Weinberg, Pier Luigi Dragotti, Kevin G. Murphy, Meng-Xing Tang
Main category: eess.IV
TL;DR: CycleULM is a label-free deep learning framework for ultrasound localization microscopy that uses CycleGAN to translate between real contrast-enhanced ultrasound data and microbubble-only domains, enabling improved localization performance and real-time processing without requiring paired ground truth data.
Details
Motivation: Current ultrasound localization microscopy (ULM) faces challenges in localization performance, data acquisition/processing time, and dependence on scarce in vivo labels or simulators with domain gaps. There's a need for a unified, label-free approach that can work with real data without paired ground truth.Method: CycleULM uses CycleGAN to learn a physics-emulating translation between real contrast-enhanced ultrasound (CEUS) data and a simplified microbubble-only domain. This label-free approach doesn’t require paired ground truth data and can be deployed as modular components or end-to-end framework within existing ULM pipelines.
Result: CycleULM improves image contrast by up to 15.3 dB, sharpens resolution with 2.5× reduction in PSF width, improves microbubble localization (+40% recall, +46% precision, -14.0 μm error), and achieves real-time processing at 18.3 FPS with up to 14.5× speed-up.
Conclusion: CycleULM provides a practical pathway toward robust, real-time ultrasound localization microscopy by combining label-free learning, performance enhancement, and computational efficiency, accelerating clinical translation of super-resolution ultrasound.
Abstract: Super-resolution ultrasound via microbubble (MB) localisation and tracking, also known as ultrasound localisation microscopy (ULM), can resolve microvasculature beyond the acoustic diffraction limit. However, significant challenges remain in localisation performance and data acquisition and processing time. Deep learning methods for ULM have shown promise to address these challenges, however, they remain limited by in vivo label scarcity and the simulation-to-reality domain gap. We present CycleULM, the first unified label-free deep learning framework for ULM. CycleULM learns a physics-emulating translation between the real contrast-enhanced ultrasound (CEUS) data domain and a simplified MB-only domain, leveraging the power of CycleGAN without requiring paired ground truth data. With this translation, CycleULM removes dependence on high-fidelity simulators or labelled data, and makes MB localisation and tracking substantially easier. Deployed as modular plug-and-play components within existing pipelines or as an end-to-end processing framework, CycleULM delivers substantial performance gains across both in silico and in vivo datasets. Specifically, CycleULM improves image contrast (contrast-to-noise ratio) by up to 15.3 dB and sharpens CEUS resolution with a 2.5{\times} reduction in the full width at half maximum of the point spread function. CycleULM also improves MB localisation performance, with up to +40% recall, +46% precision, and a -14.0 μm mean localisation error, yielding more faithful vascular reconstructions. Importantly, CycleULM achieves real-time processing throughput at 18.3 frames per second with order-of-magnitude speed-ups (up to ~14.5{\times}). By combining label-free learning, performance enhancement, and computational efficiency, CycleULM provides a practical pathway toward robust, real-time ULM and accelerates its translation to clinical applications.
[633] Image Compression Using Novel View Synthesis Priors
Luyuan Peng, Mandar Chitre, Hari Vishnu, Yuen Min Too, Bharath Kalyan, Rajat Mishra, Soo Pieng Tan
Main category: eess.IV
TL;DR: A model-based image compression technique for underwater ROVs using novel view synthesis and gradient descent optimization to achieve high compression ratios while maintaining image quality.
Details
Motivation: Underwater acoustic communication has limited bandwidth, making real-time video transmission impractical for tetherless ROV control during inspection tasks. Existing methods cannot provide sufficient compression while maintaining visual quality needed for remote operation.Method: Uses trained machine learning-based novel view synthesis models to render expected views, then applies gradient descent optimization to refine latent representations that generate compressible differences between actual camera images and rendered images.
Result: Demonstrated superior compression ratios and image quality over existing techniques on an artificial ocean basin dataset. Method shows robustness to introduction of new objects in the scene.
Conclusion: The proposed model-based compression technique enables practical real-time visual feedback for tetherless underwater ROV operations by leveraging prior mission information and novel view synthesis.
Abstract: Real-time visual feedback is essential for tetherless control of remotely operated vehicles, particularly during inspection and manipulation tasks. Though acoustic communication is the preferred choice for medium-range communication underwater, its limited bandwidth renders it impractical to transmit images or videos in real-time. To address this, we propose a model-based image compression technique that leverages prior mission information. Our approach employs trained machine-learning based novel view synthesis models, and uses gradient descent optimization to refine latent representations to help generate compressible differences between camera images and rendered images. We evaluate the proposed compression technique using a dataset from an artificial ocean basin, demonstrating superior compression ratios and image quality over existing techniques. Moreover, our method exhibits robustness to introduction of new objects within the scene, highlighting its potential for advancing tetherless remotely operated vehicle operations.
[634] Entropy-and-Channel-Aware Adaptive-Rate Semantic Communication with MLLM-Aided Feature Compensation
Weixuan Chen, Qianqian Yang, Yuhao Chen, Chongwen Huang, Qian Wang, Zehui Xiong, Zhaoyang Zhang
Main category: eess.IV
TL;DR: A semantic communication framework with adaptive rate control using MLLMs for visual feature compensation over MIMO channels
Details
Motivation: Existing semantic communication schemes operate at fixed rates regardless of channel conditions, wasting resources in good channels and degrading performance in poor channelsMethod: Proposes channel-aware semantic coding/decoding with CSI+SNR embedding, uses policy networks for selective feature/symbol transmission, and leverages InternVL3.5’s visual encoder with LoRA fine-tuning for feature compensation
Result: System automatically allocates more resources in poor channels to enhance performance while reducing usage in favorable channels while maintaining high task performance
Conclusion: The framework achieves finer-grained adaptive rate control than existing methods by combining channel-aware semantic communication with MLLM-based feature compensation
Abstract: Despite the transmission efficiency gains of semantic communication (SemCom) over traditional methods, most existing SemCom schemes still operate at a fixed transmission rate regardless of channel conditions and transmitted content, resulting in wasted resources in favorable channels and degraded performance in harsh channels. To address this issue, we propose a novel SemCom framework that incorporates an entropy-and-channel-aware adaptive rate control mechanism over MIMO Rayleigh fading channels. Specifically, we embed a joint representation of the channel state information (CSI) and the signal-to-noise ratio (SNR) into both the semantic encoder and decoder, thereby realizing channel-aware semantic coding and decoding. Moreover, the proposed method jointly exploits the CSI, the SNR, the feature maps, and their 2D entropy via two policy networks to selectively transmit only a subset of feature maps and, within each selected feature map, only a subset of symbols. Thereby, it achieves finer-grained adaptive rate control than existing methods. At the receiver, leveraging the strong visual understanding capability of multimodal large language models (MLLMs), we deploy the lightweight visual encoder (InternViT-300M) of the pre-trained InternVL3.5 model to compensate for discarded feature maps and symbols, and we fine-tune InternViT using low-rank adaptation (LoRA) for parameter-efficient training. Experimental results show that, with a carefully designed channel-aware loss function, our system automatically allocates more communication resources under poor channels to enhance task performance while reducing resource usage under favorable channels and maintaining high task performance.
[635] Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis
Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang
Main category: eess.IV
TL;DR: CoPeDiT: A latent diffusion model with completeness perception for unified 3D MRI synthesis that can infer missing states without external guidance, addressing missing modalities in multi-modal brain MRI and missing slices in cardiac MRI.
Details
Motivation: Existing methods for missing MRI data synthesis rely on external guidance (manual indicators) which are often unavailable or unreliable in real clinical environments. These explicit masks lack sufficient semantic information to guide consistent synthesis, especially for capturing subtle anatomical and pathological variations.Method: Proposes CoPeDiT, a general-purpose latent diffusion model with completeness perception. Uses CoPeVAE tokenizer with dedicated pretext tasks to learn completeness-aware discriminative prompts, and MDiT3D (specialized diffusion transformer architecture) for 3D MRI synthesis that effectively uses learned prompts as guidance.
Result: Comprehensive evaluations on three large-scale MRI datasets show CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness and yielding high-fidelity, structurally consistent synthesis across diverse missing patterns.
Conclusion: CoPeDiT enables generative models to infer missing states in a self-perceptive manner, better capturing subtle anatomical variations without relying on external guidance, making it more practical for real-world clinical scenarios.
Abstract: Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness and yielding high-fidelity, structurally consistent synthesis across diverse missing patterns.