Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar, Nishit Anand, Utkarsh Tyagi, Prem Seetharaman, Ramani Duraiswami, Dinesh Manocha
Main category: cs.SD
TL;DR: Audio Hallucination Attacks (AHA) framework exposes reliability gaps in Large Audio Language Models by testing if they genuinely ground responses in audio input, achieving high attack success rates on state-of-the-art models.
Details
Motivation: While Large Audio Language Models perform well on standard benchmarks, their reliability in real-world settings remains underexplored, particularly whether they genuinely ground responses in audio input or hallucinate based on question patterns.Method: Introduces AHA-Eval attack suite with 6.5K QA pairs targeting two attack surfaces: 1) query-based attacks exploiting question structure to induce hallucinations about absent sounds, and 2) audio-based attacks injecting synthetic speech describing non-existent events. Evaluates state-of-the-art LALMs and proposes AHA-Guard, a 120K QA post-alignment dataset for mitigation.
Result: State-of-the-art LALMs like Audio Flamingo 3 and Gemini 3 Pro show high attack success rates of 95.35% and 79.65% respectively, revealing reliability gaps hidden by standard benchmarks. AHA-Guard reduces attack success rates by up to 49%.
Conclusion: Current LALMs have significant reliability issues where they hallucinate rather than ground responses in audio input. The proposed AHA framework exposes these vulnerabilities, and AHA-Guard demonstrates effective mitigation through targeted post-alignment training.
Abstract: Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Relevance: 9/10
[2] From Natural Alignment to Conditional Controllability in Multimodal Dialogue
Zeyu Jin, Songtao Zhou, Haoyu Wang, Minghao Tian, Kaifeng Yun, Zhuo Chen, Xiaoyu Qin, Jia Jia
Main category: cs.MM
TL;DR: Introduces MM-Dia dataset for multimodal dialogue generation with fine-grained annotations from movies/TV, enabling style-controllable speech synthesis and cross-modal consistency evaluation.
Details
Motivation: Current multimodal dialogue generation methods lack controllability and expressive diversity. Existing datasets insufficiently capture rich human interaction characteristics across speech, vision, and text modalities.Method: Developed novel multimodal dialogue annotation pipeline to curate dialogues from movies/TV series with fine-grained annotations. Created MM-Dia dataset (360+ hours, 54,700 dialogues) for explicit control and MM-Dia-Bench (309 dialogues) for implicit cross-modal evaluation.
Result: Training on MM-Dia significantly enhances fine-grained controllability in dialogue generation. Current frameworks struggle to replicate nuanced human expressiveness, as revealed by MM-Dia-Bench evaluations.
Conclusion: The work provides new insights and challenges for multimodal conditional dialogue generation, highlighting limitations in current models’ ability to capture complex human interaction expressiveness across modalities.
Abstract: The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
Relevance: 9/10
[3] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji
Main category: cs.CV
TL;DR: FlexMem is a training-free approach for long video understanding in MLLMs that mimics human video watching behavior using visual memory mechanisms to handle infinite-length videos.
Details
Motivation: Long video understanding is a major challenge for Multimodal Large Language Models (MLLMs) due to input length limitations. Current methods process all video information at once and have upper limits, preventing effective understanding of long videos.Method: FlexMem uses a visual memory mechanism with dual-pathway compression for memory transfer/writing, and explores different memory reading strategies for diverse video tasks including streaming. It treats visual KV caches as memory sources and mimics human video watching behavior.
Result: On a single 3090 GPU, FlexMem achieves improvements over existing efficient video methods, processes over 1k frames, and helps base MLLMs achieve comparable or better performance than SOTA models like GPT-4o and Gemini-1.5 Pro on some benchmarks.
Conclusion: FlexMem provides an effective training-free solution for long video understanding in MLLMs, enabling infinite-length video processing through visual memory mechanisms and achieving strong performance with efficient resource usage.
Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 90]
- cs.CV [Total: 234]
- cs.AI [Total: 131]
- cs.SD [Total: 7]
- cs.LG [Total: 136]
- cs.MA [Total: 4]
- cs.MM [Total: 3]
- eess.AS [Total: 3]
- eess.IV [Total: 5]
cs.CL
[1] OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
Haiyue Song, Masao Utiyama
Main category: cs.CL
TL;DR: OptiMer enables post-hoc optimization of data mixture ratios for continual pre-training by extracting distribution vectors from per-dataset models and using Bayesian optimization to find optimal composition weights.
Details
Motivation: Traditional continual pre-training requires fixing data mixture ratios before training, which is expensive to tune and suboptimal choices waste significant compute resources. There's a need for a more flexible approach to optimize these ratios efficiently.Method: Train one CPT model per dataset, extract each model’s distribution vector (representing parameter shift), then search for optimal composition weights post-hoc using Bayesian optimization. The same vector pool can be re-optimized for different objectives without retraining.
Result: OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT. The method works across languages (Japanese, Chinese) and domains (Math, Code) on Gemma 3 27B.
Conclusion: Data mixture ratio selection, traditionally a pre-training decision, can be reformulated as post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training that reduces computational costs and enables target-tailored models on demand.
Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model’s distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.
[2] From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Daban Q. Jaff
Main category: cs.CL
TL;DR: Analysis of sentiment classifier performance on Holocaust oral histories reveals low inter-model agreement, primarily due to boundary decisions around neutrality, with proposed taxonomy for characterizing model divergence.
Details
Motivation: Polarity detection faces challenges under domain shift, especially in complex, long-form narratives like Holocaust oral histories. The paper aims to diagnose how off-the-shelf sentiment classifiers perform on such sensitive historical narratives.Method: Used three pretrained transformer-based polarity classifiers on 107,305 utterances and 579,013 sentences from Holocaust oral histories. Introduced agreement-based stability taxonomy (ABC) to stratify inter-model output stability, measured agreement metrics, and applied T5-based emotion classifier to stratified samples.
Result: Inter-model agreement was low to moderate overall, driven primarily by boundary decisions around neutrality. The combination of multi-model label triangulation and ABC taxonomy provides a framework for characterizing where and how sentiment models diverge.
Conclusion: The study provides a cautious operational framework for understanding sentiment model performance on sensitive historical narratives, highlighting the challenges of domain shift and the importance of characterizing model disagreement patterns.
Abstract: Polarity detection becomes substantially more challenging under domain shift, particularly in heterogeneous, long-form narratives with complex discourse structure, such as Holocaust oral histories. This paper presents a corpus-scale diagnostic study of off-the-shelf sentiment classifiers on long-form Holocaust oral histories, using three pretrained transformer-based polarity classifiers on a corpus of 107,305 utterances and 579,013 sentences. After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability. We report pairwise percent agreement, Cohen kappa, Fleiss kappa, and row-normalized confusion matrices to localize systematic disagreement. As an auxiliary descriptive signal, a T5-based emotion classifier is applied to stratified samples from each agreement stratum to compare emotion distributions across strata. The combination of multi-model label triangulation and the ABC taxonomy provides a cautious, operational framework for characterizing where and how sentiment models diverge in sensitive historical narratives. Inter-model agreement is low to moderate overall and is driven primarily by boundary decisions around neutrality.
[3] CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
Andrew Bouras, OMS-II Research Fellow
Main category: cs.CL
TL;DR: CrossTrace is a new dataset of 1,389 grounded scientific reasoning traces across biomedical, AI/ML, and cross-domain research for training and evaluating hypothesis generation models with explicit step-by-step reasoning chains.
Details
Motivation: Existing datasets for hypothesis generation models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions, creating a bottleneck in accelerating scientific research.Method: Created CrossTrace dataset with 1,389 reasoning traces using an Input/Trace/Output schema extending Bit-Flip-Spark framework, including step-level verification and taxonomy of eight discovery patterns. Fine-tuned Qwen2.5-7B-Instruct on CrossTrace via QLoRA.
Result: Fine-tuning substantially improved performance: IAScore rose from 0.828 to 0.968 (GPT-4o judge), structural compliance improved from 0% to 100%, and spark cosine similarity increased from 0.221 to 0.620. Balanced cross-domain training outperformed single-domain training.
Conclusion: CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, demonstrating that such traces are an effective training signal with domain-general benefits.
Abstract: Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.
[4] Covertly improving intelligibility with data-driven adaptations of speech timing
Paige Tuttösí, Angelica Lim, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier
Main category: cs.CL
TL;DR: Systematic study shows targeted speech-rate adjustments improve comprehension for both native and non-native listeners, while global slowing actually increases errors despite being perceived as clearer.
Details
Motivation: To understand whether speech rate adjustments actually improve intelligibility for listeners with comprehension challenges (hard-of-hearing, non-native speakers), and to develop data-driven methods for improving machine-generated speech accessibility.Method: Used reverse-correlation experiments to analyze temporal effects of speech rate on vowel perception, tested comprehension across native and L2 English listeners, and built a data-driven text-to-speech algorithm that replicates optimal temporal structures.
Result: Found scissor-like temporal pattern in speech rate effects (opposite effects in early vs late context windows) that is stable across listener groups. Targeted slowing improved comprehension while global slowing increased errors. Participants misjudged global slowing as clearer despite worse performance.
Conclusion: Targeted speech-rate adjustments significantly aid intelligibility under challenging conditions while often going unnoticed. Provides data-driven methodology to improve machine-generated speech accessibility that can be extended to other aspects of speech comprehension.
Abstract: Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners’ comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.
[5] Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs
Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling
Main category: cs.CL
TL;DR: Safety fine-tuning in LLMs suppresses mind-attribution tendencies but doesn’t degrade Theory of Mind capabilities, though it reduces attribution of mind to non-human animals and spiritual beliefs.
Details
Motivation: To investigate whether safety fine-tuning that suppresses LLMs' tendency to attribute mind to themselves (like claiming consciousness or emotions) also degrades related socio-cognitive abilities like Theory of Mind, and to understand the broader impacts on how models perceive non-human minds.Method: Used safety ablation experiments and mechanistic analyses of representational similarity to examine behavioral and mechanistic dissociations between mind-attribution tendencies and Theory of Mind capabilities in LLMs.
Result: LLM attributions of mind to themselves and technological artifacts are behaviorally and mechanistically dissociable from ToM capabilities. However, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief.
Conclusion: Safety fine-tuning successfully suppresses problematic mind-attribution without harming ToM abilities, but it also suppresses widely shared human perspectives about non-human minds, raising questions about alignment with human values regarding consciousness attribution.
Abstract: Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.
[6] Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis
Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang, Chunyu Qiang, Chen Zhang, Kai Yu, Xie Chen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.13802: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13802&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[7] Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection
Abhilash Nandy
Main category: cs.CL
TL;DR: The paper introduces CoMIX-Shift, a benchmark for testing compositional generalization in multi-intent detection, and presents ClauseCompose, a lightweight decoder that outperforms baselines on novel intent combinations.
Details
Motivation: Existing multi-intent detection benchmarks weakly test compositional generalization because train and test sets often share the same co-occurrence patterns. The authors aim to create a more challenging benchmark that stresses the ability to recover new combinations of familiar intents.Method: Introduces CoMIX-Shift benchmark with controlled compositional generalization tests through held-out intent pairs, discourse-pattern shift, longer/noisier wrappers, held-out clause templates, and zero-shot triples. Presents ClauseCompose, a lightweight decoder trained only on singleton intents, and compares it to whole-utterance baselines including a fine-tuned BERT model.
Result: ClauseCompose achieves 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples, significantly outperforming baselines. On a manually authored SNIPS-style set, it reaches 97.5 exact match on unseen pairs and 86.7 under connector shift.
Conclusion: Multi-intent detection needs more compositional evaluation, and simple factorization approaches like ClauseCompose can perform surprisingly well when evaluation properly tests compositional generalization.
Abstract: Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.
[8] Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction
Diego C. Lerma-Torres
Main category: cs.CL
TL;DR: A bio-inspired memory framework for LLMs based on cognitive science principles, featuring valence-based memory organization, dual-process retrieval, and active encoding to improve long-term interaction and reduce reasoning degradation.
Details
Motivation: Current LLMs lack persistent, structured memory for long-term interaction, and simply expanding context windows degrades reasoning performance. There's a need for memory systems inspired by human cognition that can maintain context without sacrificing reasoning quality.Method: Proposes a memory framework based on complementary learning systems theory, cognitive behavioral therapy’s belief hierarchy, dual-process cognition, and fuzzy-trace theory. Features three principles: (1) valence-based memory with emotional-associative summaries, (2) dual-process retrieval (System 1 default with System 2 escalation), and (3) active, feedback-dependent encoding with thalamic gateway routing.
Result: The framework specifies seven functional properties for implementation. Over time, the system converges toward System 1 processing (computational analog of clinical expertise), making interactions cheaper with experience rather than more expensive.
Conclusion: A bio-inspired memory framework addresses LLM memory limitations by incorporating cognitive science principles, enabling more efficient long-term interactions and reducing reasoning degradation compared to simple context window expansion.
Abstract: Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy’s belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck’s cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.
[9] POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation
Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Yizhou Peng, Jin Li, Yuheng Lu, Yu Jiang, Nyima Tashi, Longbiao Wang, Jianwu Dang
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2511.09232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[10] The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman
Main category: cs.CL
TL;DR: Paper studies systematic failure of LLMs when surface cues conflict with unstated feasibility constraints, using a diagnose-measure-bridge-treat framework with the “car wash problem” and Heuristic Override Benchmark.
Details
Motivation: Large language models often fail when obvious surface cues conflict with unstated feasibility constraints, revealing systematic reasoning vulnerabilities that need to be understood and addressed.Method: Causal-behavioral analysis of the “car wash problem” across six models, creation of Heuristic Override Benchmark (500 instances spanning 4 heuristic types by 5 constraint families), testing 14 models with minimal pairs and explicitness gradients, and using parametric probes and goal-decomposition prompting.
Result: Models show context-independent sigmoid heuristics where distance cues exert 8.7-38x more influence than goals; under strict evaluation, no model exceeds 75% accuracy, with presence constraints hardest (44%); minimal hints recover +15pp, goal-decomposition prompting recovers +6-9pp; 12/14 models perform worse when constraints are removed.
Conclusion: Heuristic override is a systematic reasoning vulnerability in LLMs, characterized by over-reliance on surface cues and failure to infer unstated constraints, with the benchmark providing a way to measure progress toward resolving this issue.
Abstract: Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem’’ across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) – 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients – demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.
[11] On the limited utility of parallel data for learning shared multilingual representations
Julius Leino, Jörg Tiedemann
Main category: cs.CL
TL;DR: Parallel data in pretraining has minimal effect on cross-lingual alignment; alignment emerges naturally even without explicit parallel signal.
Details
Motivation: To understand the impact of parallel data (translated sentences) in pretraining for creating shared multilingual representations and enabling cross-lingual knowledge transfer.Method: Trained reference models with varying proportions of parallel data, evaluated using multiple methods to assess cross-lingual alignment effects.
Result: Parallel data has minimal effect on cross-lingual alignment - limited to potentially accelerating representation sharing in early pretraining phases and reducing language-specific neurons. Cross-lingual alignment emerges at similar levels even without parallel data.
Conclusion: Explicit parallel data signal is not essential for achieving cross-lingual alignment in multilingual models; alignment emerges naturally during pretraining.
Abstract: Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.
[12] An Empirical Recipe for Universal Phone Recognition
Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe, David R. Mortensen
Main category: cs.CL
TL;DR: PhoneticXEUS achieves SOTA multilingual phone recognition through large-scale training and systematic analysis of SSL representations, data scale, and loss objectives across 100+ languages.
Details
Motivation: Current phone recognition models struggle with multilingual generalization - English-focused models don't work across languages, while multilingual models underutilize pretrained representations. There's also limited understanding of how data scale, architecture, and training objectives affect multilingual performance.Method: Developed PhoneticXEUS trained on large-scale multilingual data with controlled ablations to analyze SSL representations, data scale, and loss objectives. Evaluated across 100+ languages under a unified scheme and analyzed error patterns across language families, accented speech, and articulatory features.
Result: Achieved state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Established training recipe through systematic analysis and quantified impact of various factors on performance.
Conclusion: Large-scale multilingual training with proper utilization of SSL representations enables robust phone recognition across diverse languages and accents. The open release of data and code supports further research in multilingual speech processing.
Abstract: Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS – trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.
[13] Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
Aizirek Turdubaeva, Uichin Lee
Main category: cs.CL
TL;DR: LLMs struggle with cross-cultural emotion attribution due to overlooking cultural backgrounds of emotion generators; proposed Generator-Interpreter framework improves performance by considering both expression and interpretation perspectives across 15 countries.
Details
Motivation: Current LLM-based emotion attribution systems focus mainly on interpretation while overlooking the cultural background of emotion generators, assuming universality that neglects cultural variations in emotional expression and perception across different nations.Method: Proposed a Generator-Interpreter framework capturing dual perspectives of emotion attribution by considering both expression and interpretation. Systematically evaluated six LLMs on emotion attribution tasks using data from 15 countries to analyze cultural variations.
Result: Performance variations depend on emotion type and cultural context. Generator-interpreter alignment effects are present, with the generator’s country of origin having stronger impact on performance than interpreter’s background.
Conclusion: Culturally sensitive emotion modeling is needed in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts, addressing the limitations of universal assumptions.
Abstract: Large language models (LLMs) are increasingly used in cross-cultural systems to understand and adapt to human emotions, which are shaped by cultural norms of expression and interpretation. However, prior work on emotion attribution has focused mainly on interpretation, overlooking the cultural background of emotion generators. This assumption of universality neglects variation in how emotions are expressed and perceived across nations. To address this gap, we propose a Generator-Interpreter framework that captures dual perspectives of emotion attribution by considering both expression and interpretation. We systematically evaluate six LLMs on an emotion attribution task using data from 15 countries. Our analysis reveals that performance variations depend on the emotion type and cultural context. Generator-interpreter alignment effects are present; the generator’s country of origin has a stronger impact on performance. We call for culturally sensitive emotion modeling in LLM-based systems to improve robustness and fairness in emotion understanding across diverse cultural contexts.
[14] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.22715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[15] PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Caio Vicentino
Main category: cs.CL
TL;DR: PolarQuant is a post-training weight quantization method for LLMs that uses hypersphere normalization, Walsh-Hadamard rotation, and Gaussian-matched quantization to achieve near-lossless compression without calibration data.
Details
Motivation: Current weight quantization methods for LLMs often suffer from quality degradation and require calibration data. The authors aim to develop a quantization approach that exploits the distributional structure of neural network weights to achieve near-lossless compression without needing calibration data.Method: Three-stage approach: 1) Block-wise normalization to unit hypersphere, 2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, 3) Quantization with centroids matched to the Gaussian distribution. The method works as a preprocessing step for downstream INT4 quantizers.
Result: Reduces Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Δ=+0.03 from FP16), making it practically lossless. When used as preprocessing for INT4 quantization, achieves perplexity 6.56 vs 6.68 for direct absmax INT4, with 43.1 tok/s throughput at 6.5 GB VRAM.
Conclusion: PolarQuant achieves near-lossless weight quantization for LLMs without calibration data, with Walsh-Hadamard rotation accounting for 98% of quality improvement. It also serves as effective preprocessing for downstream INT4 quantizers, maintaining good throughput and memory efficiency.
Abstract: We present PolarQuant, a post-training weight quantization method for large language models (LLMs) that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.
[16] APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay
Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha
Main category: cs.CL
TL;DR: APEX-EM is a non-parametric online learning framework that gives LLM-based agents persistent procedural memory by accumulating, retrieving, and reusing structured procedural plans without modifying model weights.
Details
Motivation: LLM-based autonomous agents lack persistent procedural memory, forcing them to re-derive solutions from scratch even when structurally identical tasks have been solved before, leading to inefficiency and wasted computational resources.Method: APEX-EM introduces: (1) structured experience representation encoding full procedural-episodic traces; (2) Plan-Retrieve-Generate-Iterate-Ingest workflow with Task Verifiers providing multi-dimensional reward signals; (3) dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal.
Result: On KGQAGen-10k: 89.6% accuracy vs 41.3% without memory (+48.3pp), surpassing oracle-retrieval upper bound (84.9%). On BigCodeBench: 83.3% SR from 53.9% baseline (+29.4pp). On HLE: entity graph retrieval reaches 48.0% from 25.2% (+22.8pp).
Conclusion: APEX-EM enables cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure, with successful experiences serving as positive in-context examples and failures as negative examples with structured error annotations.
Abstract: LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution – planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal – enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity’s Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL’s~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.
[17] Concept Training for Human-Aligned Language Models
Christine Zhang, Dan Jurafsky, Chen Shani
Main category: cs.CL
TL;DR: Paper proposes concept-level supervision instead of standard next-token prediction, training models to predict sets of semantically related tokens rather than single tokens, improving semantic alignment with human judgments.
Details
Motivation: Standard next-token prediction treats alternative valid continuations as mutually exclusive, but natural language prefixes can have multiple valid continuations with similar meanings. Current NTP fails to capture semantic equivalence between different surface forms.Method: Introduces concept supervision framework where models predict concepts approximated as sets of semantically related tokens, rather than single tokens. Uses concept-level objectives alongside standard NTP training.
Result: Models trained with concept supervision show stronger alignment with human semantic similarity judgments on lexical benchmarks. Lower perplexity on semantically meaningful words but modest increase in global token-level perplexity, indicating tradeoff between NTP optimization and concept supervision.
Conclusion: Concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance, suggesting promising direction for better semantic understanding in language models.
Abstract: The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}’’ could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.
[18] MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park, Seunghyun lee, Haihua Chen, Yi Bu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2510.16635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[19] Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
George Boateng, Samuel Boateng, Victor Kumbol
Main category: cs.CL
TL;DR: Kwame 2.0: A bilingual generative AI teaching assistant using retrieval-augmented generation with human-in-the-loop oversight for coding education in Africa
Details
Motivation: Addressing the challenge of providing timely and accurate learning support in large-scale online coding courses, especially in resource-constrained contexts like Africa, where human teaching assistants are scarce.Method: Built a bilingual (English-French) generative AI teaching assistant using retrieval-augmented generation (RAG) deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course. The system retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation.
Result: Deployed in a 15-month longitudinal study across 15 cohorts with 3,717 enrollments in 35 African countries. Evaluation showed Kwame 2.0 provided high-quality, timely support with high accuracy on curriculum-related questions. Human facilitators and peers effectively mitigated errors, especially for administrative queries.
Conclusion: Human-in-the-loop generative AI systems can combine AI scalability and speed with human reliability, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.
Abstract: Providing timely and accurate learning support in large-scale online coding courses is challenging, particularly in resource-constrained contexts. We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners across Africa. Kwame 2.0 retrieves relevant course materials and generates context-aware responses while encouraging human oversight and community participation. We deployed the system in a 15-month longitudinal study spanning 15 cohorts with 3,717 enrollments across 35 African countries. Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors, particularly for administrative queries. Our findings demonstrate that human-in-the-loop generative AI systems can combine the scalability and speed of AI with the reliability of human support, offering an effective approach to learning assistance for underrepresented populations in resource-constrained settings at scale.
[20] SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
Mohammad Amer Khalil, Raghad Nahas, Ahmad Nassar, Khloud Al Jallad
Main category: cs.CL
TL;DR: SyriSign: First public dataset for Syrian Arabic Sign Language with 1500 video samples across 150 signs, evaluated with multimodal models for text-to-sign translation.
Details
Motivation: Address the lack of publicly available datasets for low-resource sign languages like Syrian Arabic Sign Language (SyArSL), which creates communication barriers for the deaf community in Syria where news is primarily delivered in spoken/written Arabic.Method: Created SyriSign dataset with 1500 video samples across 150 unique lexical signs. Evaluated using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment.
Result: Generative approaches show strong potential for sign representation, but limited dataset size constrains generalization performance. The dataset will be publicly released as an initial benchmark.
Conclusion: SyriSign addresses a critical gap in low-resource sign language datasets and demonstrates the potential of multimodal generative models for sign language translation, though larger datasets are needed for better generalization.
Abstract: Sign language is the primary approach of communication for the Deaf and Hard-of-Hearing (DHH) community. While there are numerous benchmarks for high-resource sign languages, low-resource languages like Arabic remain underrepresented. Currently, there is no publicly available dataset for Syrian Arabic Sign Language (SyArSL). To overcome this gap, we introduce SyriSign, a dataset comprising 1500 video samples across 150 unique lexical signs, designed for text-to-SyArSL translation tasks. This work aims to reduce communication barriers in Syria, as most news are delivered in spoken or written Arabic, which is often inaccessible to the deaf community. We evaluated SyriSign using three deep learning architectures: MotionCLIP for semantic motion generation, T2M-GPT for text-conditioned motion synthesis, and SignCLIP for bilingual embedding alignment. Experimental results indicate that while generative approaches show strong potential for sign representation, the limited dataset size constrains generalization performance. We will release SyriSign publicly, hoping it serves as an initial benchmark.
[21] SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
Ranidu Gurusinghe, Nevidu Jayatilleke
Main category: cs.CL
TL;DR: SiPaKosa is a large corpus of Sinhala and Pali Buddhist texts created via OCR and web scraping, used to evaluate language models for historical language analysis and cultural preservation.
Details
Motivation: To create a comprehensive corpus of Sinhala and Pali doctrinal texts to support domain-adapted language models, facilitate historical language analysis, aid Buddhist scholarship information retrieval, and preserve Sinhala cultural heritage.Method: Used Google Document AI OCR on historical manuscripts combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. Organized into language-specific subcorpora (Sinhala and Mixed Sinhala-Pali).
Result: Created a corpus of approximately 786K sentences and 9.25M words. Evaluated 10 pretrained language models with perplexity scores ranging from 1.09 to 189.67, showing proprietary models outperform open-source alternatives by 3-6 times.
Conclusion: The SiPaKosa corpus enables pretraining of domain-adapted language models, supports historical language analysis, aids Buddhist scholarship information retrieval, and preserves Sinhala cultural heritage while revealing performance gaps between proprietary and open-source models.
Abstract: SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.
[22] Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang
Main category: cs.CL
TL;DR: LiteCoST: A two-pillar framework for document QA using small language models that produces structured outputs via Chain-of-Structured-Thought and fine-tunes SLMs for LLM-comparable quality with lower latency.
Details
Motivation: Direct reasoning over long, noisy documents with LLMs is brittle and error-prone. There's a need for document QA that consolidates dispersed evidence into structured outputs (tables, graphs, chunks) for reliable, verifiable QA while achieving both high accuracy and low latency with small language models.Method: Two-pillar framework: 1) Chain-of-Structured-Thought (CoST) - schema-aware instruction guiding strong LLMs to produce step-wise reasoning traces and structured outputs with normalization, alignment, serialization, and verification. 2) SLM fine-tuning - training compact models on LLM-generated CoST data via supervised fine-tuning followed by Group Relative Policy Optimization with triple rewards for answer quality, format quality, and process consistency.
Result: Achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, with 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B).
Conclusion: LiteCoST successfully distills structure-first behavior into small language models, enabling efficient document QA with structured outputs while maintaining quality comparable to large models.
Abstract: Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.
[23] The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni
Main category: cs.CL
TL;DR: Thiomi Dataset: A large-scale multimodal corpus for 10 African languages with text annotations and audio recordings, collected via community platform with quality assurance, used to train ASR, MT, and TTS models achieving state-of-the-art results.
Details
Motivation: Address the scarcity of resources for African languages in speech and language technology by creating a comprehensive multimodal dataset spanning multiple language families and regions.Method: Community-driven data collection platform involving 100+ contributors, multi-tier quality assurance pipeline, supplementation with existing Common Voice data for Swahili, and training/evaluation of ASR, MT, and TTS models across all languages.
Result: Dataset contains 601,000+ text annotations and 385,000+ audio recordings across 9 languages; ASR system achieves 3.24% WER on Swahili (61% relative reduction from prior SOTA) and 4.3% WER on Somali; establishes baselines for all 10 languages.
Conclusion: The Thiomi Dataset successfully addresses resource scarcity for African languages, demonstrates utility through strong baseline models, and contributes to African language technology infrastructure development.
Abstract: We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset’s utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.
[24] MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yi Gong
Main category: cs.CL
TL;DR: MemRerank: A preference memory framework that distills user purchase history into concise signals for personalized product reranking in LLM-based shopping agents, using RL-trained memory extraction and achieving +10.61% accuracy improvements.
Details
Motivation: LLM-based shopping agents struggle with personalization using raw purchase histories due to noise, length, and relevance mismatch. Current approaches naively append history to prompts, which is ineffective for product recommendation tasks.Method: Proposes MemRerank framework that distills purchase history into query-independent preference memory. Uses RL to train memory extractor with downstream reranking performance as supervision. Built benchmark with 1-in-5 selection task to evaluate memory quality and reranking utility.
Result: MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, achieving up to +10.61 absolute points in 1-in-5 accuracy with two LLM-based rerankers.
Conclusion: Explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems, demonstrating that distilled memory signals outperform raw history for LLM-based shopping agents.
Abstract: LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
[25] CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
Shohei Higashiyama, Masao Ideuchi, Masao Utiyama
Main category: cs.CL
TL;DR: Created Japanese entity linking corpus with Japan-specific entities to address limited resources for evaluating Japanese systems.
Details
Motivation: Entity linking resources are primarily developed for English, with limited evaluation resources available for Japanese systems, creating a need for Japanese-specific entity linking corpora.Method: Developed corpus design policy for entity linking task, constructed annotated corpus with rich coverage of Japan-specific linguistic expressions referring to entities, evaluated inter-annotator agreement.
Result: High inter-annotator agreement confirms annotation consistency, preliminary entity disambiguation experiments show corpus contains substantial non-trivial cases, supporting its usefulness as evaluation benchmark.
Conclusion: Successfully created a valuable Japanese entity linking corpus that addresses the resource gap and can serve as an evaluation benchmark for Japanese entity linking systems.
Abstract: Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.
[26] Open Machine Translation for Esperanto
Ona de Gibert, Lluís de Gibert
Main category: cs.CL
TL;DR: Comprehensive evaluation of open-source machine translation systems for Esperanto, showing NLLB models perform best across six language directions involving English, Spanish, Catalan, and Esperanto.
Details
Motivation: Esperanto has substantial online resources but remains underexplored in modern machine translation approaches. The paper aims to provide the first comprehensive evaluation of open-source MT systems for Esperanto across different model types and sizes.Method: Evaluated rule-based systems, encoder-decoder models, and LLMs across model sizes. Compared translation quality across six language directions (English, Spanish, Catalan, Esperanto) using multiple automatic metrics and human evaluation.
Result: NLLB family achieved best performance in all language pairs, followed closely by trained compact models and fine-tuned general-purpose LLM. Human evaluation confirmed this trend with NLLB translations preferred in approximately half of comparisons, though noticeable errors remain.
Conclusion: Comprehensive evaluation establishes baseline performance for Esperanto MT, with NLLB models performing best. The work contributes to Esperanto’s tradition of openness by releasing code and best-performing models publicly.
Abstract: Esperanto is a widespread constructed language, known for its regular grammar and productive word formation. Besides having substantial resources available thanks to its online community, it remains relatively underexplored in the context of modern machine translation (MT) approaches. In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes. We evaluate translation quality across six language directions involving English, Spanish, Catalan, and Esperanto using multiple automatic metrics as well as human evaluation. Our results show that the NLLB family achieves the best performance in all language pairs, followed closely by our trained compact models and a fine-tuned general-purpose LLM. Human evaluation confirms this trend, with NLLB translations preferred in approximately half of the comparisons, although noticeable errors remain. In line with Esperanto’s tradition of openness and international collaboration, we release our code and best-performing models publicly.
[27] L-ReLF: A Framework for Lexical Dataset Creation
Anass Sedrati, Mounir Afifi, Reda Benkhadra
Main category: cs.CL
TL;DR: L-ReLF is a reproducible methodology for creating structured lexical datasets for low-resource languages, addressing challenges like inconsistent terminology and OCR biases, with outputs compatible with Wikidata Lexemes for downstream NLP applications.
Details
Motivation: The paper addresses the critical barrier to knowledge equity in platforms like Wikipedia for underserved languages, exemplified by Moroccan Darija, where lack of standardized terminology forces editors to rely on inconsistent, ad-hoc methods for creating new words.Method: Developed a technical pipeline for low-resource languages that includes source identification, Optical Character Recognition (OCR) despite biases towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model.
Result: Created a structured lexical dataset fully compatible with Wikidata Lexemes, providing a vital technical resource for underserved language communities.
Conclusion: The L-ReLF methodology offers a generalizable, reproducible approach for language communities to build foundational lexical data for downstream NLP applications like Machine Translation and morphological analysis.
Abstract: This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.
[28] Developing a Guideline for the Labovian-Structural Analysis of Oral Narratives in Japanese
Amane Watahiki, Tomoki Doi, Akari Kikuchi, Hiroshi Ohata, Yuki I. Nakata, Takuya Niikawa, Taiga Shinozaki, Hitomi Yanaka
Main category: cs.CL
TL;DR: First systematic guidelines for Labovian narrative analysis of Japanese narrative data, extending the framework with Japanese-specific clause segmentation rules and achieving high annotation agreement.
Details
Motivation: Existing Labovian narrative analysis frameworks are only available in English and don't account for Japanese grammatical and discourse differences, creating a gap for Japanese qualitative research.Method: Developed systematic guidelines that retain all six Labovian categories while adding explicit rules for Japanese clause segmentation, covering broader clause and narrative types than existing frameworks.
Result: Annotators achieved high agreement in clause segmentation (Fleiss’ kappa = 0.80) and moderate agreement in structural classification tasks (Krippendorff’s alpha = 0.41 and 0.45), with one task showing slightly higher agreement than prior work despite finer-grained distinctions.
Conclusion: The guidelines successfully adapt Labovian narrative analysis to Japanese, addressing language-specific challenges and enabling future development of larger Japanese narrative datasets for qualitative research.
Abstract: Narrative analysis is a cornerstone of qualitative research. One leading approach is the Labovian model, but its application is labor-intensive, requiring a holistic, recursive interpretive process that moves back and forth between individual parts of the transcript and the transcript as a whole. Existing Labovian datasets are available only in English, which differs markedly from Japanese in terms of grammar and discourse conventions. To address this gap, we introduce the first systematic guidelines for Labovian narrative analysis of Japanese narrative data. Our guidelines retain all six Labovian categories and extend the framework by providing explicit rules for clause segmentation tailored to Japanese constructions. In addition, our guidelines cover a broader range of clause types and narrative types. Using these guidelines, annotators achieved high agreement in clause segmentation (Fleiss’ kappa = 0.80) and moderate agreement in two structural classification tasks (Krippendorff’s alpha = 0.41 and 0.45, respectively), one of which is slightly higher than that found in prior work despite the use of finer-grained distinctions. This paper describes the Labovian model, the proposed guidelines, the annotation process, and their utility. It concludes by discussing the challenges encountered during the annotation process and the prospects for developing a larger dataset for structural narrative analysis in Japanese qualitative research.
[29] Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao, Chaohao Du, Ruishan Liu
Main category: cs.CL
TL;DR: CPB-Bench is a bilingual benchmark for evaluating medical LLMs on challenging patient behaviors like contradictions, inaccuracies, self-diagnosis, and care resistance, revealing consistent failure patterns in handling unrealistic patient inputs.
Details
Motivation: Existing medical LLM evaluations assume idealized patient questions, limiting realism. Real medical consultations involve challenging patient behaviors that complicate safe clinical reasoning, requiring better evaluation frameworks.Method: Defined four clinically grounded challenging patient behaviors, created CPB-Bench with 692 multi-turn dialogues annotated with these behaviors from four existing medical datasets, evaluated various LLMs, and tested intervention strategies.
Result: Models perform well overall but show consistent failure patterns, particularly in handling contradictory or medically implausible patient information. Intervention strategies yield inconsistent improvements and can introduce unnecessary corrections.
Conclusion: Current medical LLMs struggle with challenging patient behaviors, highlighting the need for more robust evaluation and improved handling of unrealistic patient inputs in medical consultation settings.
Abstract: Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.
[30] Is my model perplexed for the right reason? Contrasting LLMs’ Benchmark Behavior with Token-Level Perplexity
Zoë Prins, Samuele Punzo, Frank Wildenburg, Giovanni Cinà, Sandro Pezzelle
Main category: cs.CL
TL;DR: A perplexity-based interpretability framework for LLMs that tests reliance on linguistically relevant cues through minimal sentence pairs
Details
Motivation: Standard LLM evaluations focus on task performance but offer limited insight into whether correct behavior reflects appropriate underlying mechanisms, risking confirmation bias. Need for principled interpretability methods to understand if models truly rely on linguistically relevant cues.Method: Introduces a token-level perplexity framework comparing perplexity distributions over minimal sentence pairs differing in one or a few ‘pivotal’ tokens. Enables hypothesis-driven analysis without unstable feature-attribution techniques.
Result: Experiments on controlled linguistic benchmarks with open-weight LLMs show that while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing models rely on heuristics other than expected linguistic ones.
Conclusion: The perplexity-based framework provides a principled interpretability method that reveals LLMs don’t solely rely on linguistically relevant cues but use additional heuristics, challenging assumptions about model mechanisms.
Abstract: Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity distributions over minimal sentence pairs differing in one or a few `pivotal’ tokens, our method enables precise, hypothesis-driven analysis without relying on unstable feature-attribution techniques. Experiments on controlled linguistic benchmarks with several open-weight LLMs show that, while linguistically important tokens influence model behavior, they never fully explain perplexity shifts, revealing that models rely on heuristics other than the expected linguistic ones.
[31] CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng, Angel Hsing-Chi Hwang, Adam C. Frank, Ruishan Liu
Main category: cs.CL
TL;DR: CounselReflect is a toolkit for auditing mental-health support dialogues in conversational systems, providing structured multi-dimensional reports with model-based and rubric-based metrics for transparent quality assessment.
Details
Motivation: Users of mental-health support conversational systems (like LLM-based tools) lack structured ways to audit the quality and potential risks of the support they receive, creating a need for transparent evaluation tools.Method: The system integrates two evaluation approaches: 1) 12 model-based metrics from task-specific predictors, and 2) rubric-based metrics from a literature-derived library (69 metrics) plus user-defined custom metrics, operationalized with configurable LLM judges. It’s available as web app, browser extension, and CLI.
Result: Human evaluation with 20 participants and 6 mental-health professionals suggests CounselReflect supports understandable, usable, and trustworthy auditing. The system provides session-level summaries, turn-level scores, and evidence-linked excerpts.
Conclusion: CounselReflect enables transparent, multi-dimensional auditing of mental-health support dialogues, addressing the need for structured quality assessment in conversational mental-health systems.
Abstract: Mental-health support is increasingly mediated by conversational systems (e.g., LLM-based tools), but users often lack structured ways to audit the quality and potential risks of the support they receive. We introduce CounselReflect, an end-to-end toolkit for auditing mental-health support dialogues. Rather than producing a single opaque quality score, CounselReflect provides structured, multi-dimensional reports with session-level summaries, turn-level scores, and evidence-linked excerpts to support transparent inspection. The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined custom metrics, operationalized with configurable LLM judges. CounselReflect is available as a web application, browser extension, and command-line interface (CLI), enabling use in real-time settings as well as at scale. Human evaluation includes a user study with 20 participants and an expert review with 6 mental-health professionals, suggesting that CounselReflect supports understandable, usable, and trustworthy auditing. A demo video and full source code are also provided.
[32] Authorship Impersonation via LLM Prompting does not Evade Authorship Verification Methods
Baoyi Zeng, Andrea Nini
Main category: cs.CL
TL;DR: LLM-generated author impersonations fail to bypass forensic authorship verification systems, with some methods performing better against impersonations than genuine negative samples due to LLMs’ higher lexical diversity.
Details
Motivation: With the rise of LLMs, there's concern that adversaries could use them to impersonate others' writing styles in forensic contexts, potentially undermining authorship verification systems used in legal investigations.Method: Used GPT-4o to generate impersonation texts across three genres (emails, text messages, social media posts) under four prompting conditions, then evaluated against both traditional (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural (AdHominem, LUAR, STAR) AV methods within a likelihood-ratio framework.
Result: LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. Some methods achieved higher accuracy when rejecting impersonation texts compared to genuine negative samples. The resilience stems from higher lexical diversity and entropy in LLM-generated texts.
Conclusion: Current authorship verification systems remain robust against entry-level LLM impersonation attempts across multiple genres, despite LLM accessibility. The counter-intuitive resilience is partly due to LLMs’ tendency to produce text with higher lexical diversity than human authors.
Abstract: Authorship verification (AV), the task of determining whether a questioned text was written by a specific individual, is a critical part of forensic linguistics. While manual authorial impersonation by perpetrators has long been a recognized threat in historical forensic cases, recent advances in large language models (LLMs) raise new challenges, as adversaries may exploit these tools to impersonate another’s writing. This study investigates whether prompted LLMs can generate convincing authorial impersonations and whether such outputs can evade existing forensic AV systems. Using GPT-4o as the adversary model, we generated impersonation texts under four prompting conditions across three genres: emails, text messages, and social media posts. We then evaluated these outputs against both non-neural AV methods (n-gram tracing, Ranking-Based Impostors Method, LambdaG) and neural approaches (AdHominem, LUAR, STAR) within a likelihood-ratio framework. Results show that LLM-generated texts failed to sufficiently replicate authorial individuality to bypass established AV systems. We also observed that some methods achieved even higher accuracy when rejecting impersonation texts compared to genuine negative samples. Overall, these findings indicate that, despite the accessibility of LLMs, current AV systems remain robust against entry-level impersonation attempts across multiple genres. Furthermore, we demonstrate that this counter-intuitive resilience stems, at least in part, from the higher lexical diversity and entropy inherent in LLM-generated texts.
[33] M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Seung Hun Han, Youssef Mohamed, Mohamed Elhoseiny
Main category: cs.CL
TL;DR: M-MiniGPT4 is a multilingual vision-language model that extends MiniGPT4 to support 11 languages, achieving state-of-the-art performance on multilingual benchmarks through mixed data training and parallel text alignment.
Details
Motivation: The paper addresses the need for multilingual vision-language understanding capabilities, as most existing vision-language models are primarily English-centric. There's a gap in supporting low-resource languages and enabling cross-lingual vision-language applications.Method: Extends MiniGPT4 architecture with multilingual capabilities using: 1) Mixture of native multilingual and translated data for training, 2) Multilingual alignment training stage using parallel text corpora to enhance cross-lingual capabilities, 3) Support for 11 languages.
Result: Achieves 36% accuracy on multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class. The model demonstrates strong vision-language understanding across 11 languages.
Conclusion: M-MiniGPT4 successfully extends vision-language understanding to multilingual settings, providing a strong baseline for future research in low-resource and multilingual vision-language applications.
Abstract: This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.
[34] Calibrated Confidence Expression for Radiology Report Generation
David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler, Rickmer Braren, Nassir Navab, Matthias Keicher
Main category: cs.CL
TL;DR: ConRad is a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports, improving safety through selective verification.
Details
Motivation: Safe deployment of LVLMs in radiology requires clinically interpretable confidence indicators to enable selective verification and reduce hallucination risks. Current models are overconfident, and calibration research in multimodal medical settings is limited.Method: Reinforcement learning framework using GRPO algorithm with reward functions based on logarithmic scoring rule. Two settings: single report-level confidence score and sentence-level variant assigning confidence to each claim.
Result: ConRad substantially improves calibration and outperforms competing methods. Clinical evaluation shows report-level scores align well with clinicians’ judgment.
Conclusion: ConRad enables safer clinical integration of AI-assisted report generation by highlighting reports or low-confidence statements for targeted review.
Abstract: Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad’s report level scores are well aligned with clinicians’ judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.
[35] MemFactory: Unified Inference & Training Framework for Agent Memory
Ziliang Guo, Ziheng Li, Zhiyu Li
Main category: cs.CL
TL;DR: MemFactory is a unified framework for training and evaluating memory-augmented LLM agents with modular memory lifecycle components and integrated RL optimization.
Details
Motivation: Existing memory-augmented LLM implementations are fragmented and task-specific, lacking unified infrastructure for integrating, training, and evaluating complex memory operation pipelines.Method: Abstracts memory lifecycle into plug-and-play components with “Lego-like” architecture, integrates Group Relative Policy Optimization (GRPO) for fine-tuning memory management policies, and supports cutting-edge paradigms like Memory-R1, RMM, and MemAgent.
Result: Empirical validation on MemAgent architecture shows consistent performance improvements over base models with relative gains up to 14.8% across in-domain and out-of-distribution evaluation sets.
Conclusion: MemFactory provides standardized, extensible infrastructure that lowers the barrier to entry and enables future innovations in memory-driven AI agents.
Abstract: Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a “Lego-like” architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.
[36] Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models
Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi
Main category: cs.CL
TL;DR: Distilling privacy assessment capabilities from large LLMs into lightweight encoder models for efficient privacy evaluation of textual data
Details
Motivation: Current LLM-based privacy evaluators are computationally expensive and impractical for processing sensitive data at scale, creating a need for efficient alternativesMethod: Distill privacy assessment capabilities from Mistral Large 3 (675B) into lightweight encoder models (as few as 150M parameters) using a large-scale dataset of privacy-annotated texts across 10 domains
Result: Efficient classifiers preserve strong agreement with human annotations while dramatically reducing computational requirements, validated on human-annotated test data
Conclusion: Lightweight models can effectively serve as practical privacy evaluation metrics for de-identification systems while maintaining accuracy comparable to human judgments
Abstract: Accurate privacy evaluation of textual data remains a critical challenge in privacy-preserving natural language processing. Recent work has shown that large language models (LLMs) can serve as reliable privacy evaluators, achieving strong agreement with human judgments; however, their computational cost and impracticality for processing sensitive data at scale limit real-world deployment. We address this gap by distilling the privacy assessment capabilities of Mistral Large 3 (675B) into lightweight encoder models with as few as 150M parameters. Leveraging a large-scale dataset of privacy-annotated texts spanning 10 diverse domains, we train efficient classifiers that preserve strong agreement with human annotations while dramatically reducing computational requirements. We validate our approach on human-annotated test data and demonstrate its practical utility as an evaluation metric for de-identification systems.
[37] LLM Probe: Evaluating LLMs for Low-Resource Languages
Hailay Kidu Teklehaymanot, Gebrearegawi Gebremariam, Wolfgang Nejdl
Main category: cs.CL
TL;DR: LLM Probe is a lexicon-based framework for evaluating LLMs’ linguistic abilities in low-resource languages, tested on a Semitic language case study showing performance differences between model types.
Details
Motivation: There's limited understanding of LLMs' linguistic abilities in low-resource and morphologically rich languages due to scarce annotated resources and lack of standardized evaluation frameworks.Method: Created LLM Probe framework with manually annotated benchmark dataset for a low-resource Semitic language, evaluating models across lexical alignment, POS recognition, morphosyntactic probing, and translation accuracy.
Result: Sequence-to-sequence models excel in morphosyntactic analysis and translation, while causal models perform well in lexical alignment but have weaker translation accuracy.
Conclusion: Linguistically grounded evaluation is needed to understand LLM limitations in low-resource settings; framework and dataset released as open-source tools for reproducible benchmarking.
Abstract: Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.
[38] Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics
Alain Vázquez, Maria Inés Torres
Main category: cs.CL
TL;DR: Analysis of how providing task demonstrators (MR-sentence pairs) to NLG models improves dialogue generation quality across different datasets and metrics
Details
Motivation: To analyze whether providing task demonstrators to natural language generation models enhances generation quality for conversational systems, particularly examining how this approach performs across different domains, dataset characteristics, and evaluation metricsMethod: Used enriched inputs with MR-sentence pair demonstrators extracted from original datasets, evaluated across four diverse datasets with different characteristics (domain, size, lexicon, MR variability, acquisition process) using five metrics focusing on different linguistic aspects
Result: Enriched inputs are effective for complex tasks and small datasets with high MR/sentence variability, beneficial in zero-shot settings across domains. Semantic metrics capture generation quality better than lexical metrics, with human-trained semantic metrics detecting subtle issues that embedding-based metrics miss
Conclusion: Task demonstrators enhance NLG model performance, especially for complex/small datasets, and semantic metrics are more accurate for evaluation than lexical metrics. Models show fast adaptability and robustness at semantic and communicative intention levels
Abstract: Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.
[39] Baby Scale: Investigating Models Trained on Individual Children’s Language Input
Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank
Main category: cs.CL
TL;DR: LMs trained on child-scale language data show acceptable grammar scaling but poor semantic/world knowledge scaling compared to synthetic data, with performance variability across children’s experiences and correlations between model likelihoods and children’s word learning.
Details
Motivation: To understand the "data gap" between language models requiring massive datasets and human children learning from limited natural input, by benchmarking LMs on human-scale datasets to study how linguistic knowledge emerges from children's natural training data.Method: Used transcripts from BabyView dataset (videos from children ages 6-36 months) to investigate: (1) scaling performance at child-scale data regimes, (2) variability across different children’s experiences and linguistic predictors of dataset quality, (3) relationships between model and child language learning outcomes.
Result: LMs trained on child data show acceptable scaling for grammar tasks but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; substantial variability across different children’s data; performance associated with distributional and interactional linguistic features; model likelihoods for individual words correlate with children’s learning of those words.
Conclusion: Understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition, with implications for both AI and cognitive science.
Abstract: Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this “data gap” requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children’s natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children’s experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children’s learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.
[40] Can LLM Agents Identify Spoken Dialects like a Linguist?
Tobias Bystrich, Lukas Hamm, Maria Hassan, Lea Fischbach, Lucie Flek, Akbar Karimi
Main category: cs.CL
TL;DR: LLMs can perform Swiss German dialect classification using ASR phonetic transcriptions and linguistic resources, achieving improved results with linguistic information and showing potential comparable to HuBERT models.
Details
Motivation: Audio dialect classification is challenging due to scarce labeled dialectal speech data. The paper explores whether LLMs can understand dialects and perform comparably to specialized models like HuBERT for Swiss German dialect classification.Method: Uses phonetic transcriptions from ASR systems combined with linguistic resources (dialect feature maps, vowel history, rules). Compares LLM performance against HuBERT models and establishes both LLM and human linguist baselines.
Result: LLM predictions improve when linguistic information is provided. Human baseline shows automatically generated transcriptions can be beneficial for classification but also highlight areas for improvement.
Conclusion: LLMs show promise for dialect classification tasks when augmented with linguistic resources, offering a viable alternative to specialized audio models despite challenges with automatic transcriptions.
Abstract: Due to the scarcity of labeled dialectal speech, audio dialect classification is a challenging task for most languages, including Swiss German. In this work, we explore the ability of large language models (LLMs) as agents in understanding the dialects and whether they can show comparable performance to models such as HuBERT in dialect classification. In addition, we provide an LLM baseline and a human linguist one. Our approach uses phonetic transcriptions produced by ASR systems and combines them with linguistic resources such as dialect feature maps, vowel history, and rules. Our findings indicate that, when linguistic information is provided, the LLM predictions improve. The human baseline shows that automatically generated transcriptions can be beneficial for such classifications, but also presents opportunities for improvement.
[41] Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models
Linda Zeng, Steven Y. Feng, Michael C. Frank
Main category: cs.CL
TL;DR: Using language models to simulate multilingual acquisition, researchers find bilingual models perform similarly to monolingual models in primary language while gaining strong second language proficiency, suggesting no inherent disadvantage to bilingual learning.
Details
Motivation: To address theoretical and practical questions about multilingual acquisition in children, particularly whether bilingualism causes learning delays and how different exposure regimes affect language learning outcomes, using controlled computational simulations.Method: Created matched 100M-word monolingual and bilingual datasets using synthetic data and machine translation, trained GPT-2 models on various exposure regimes, and evaluated performance on perplexity, grammaticality, and semantic knowledge metrics.
Result: Bilingual models performed similarly to monolingual models in their primary language while showing strong performance in the second language, with no significant differences between different bilingual exposure regimes.
Conclusion: Bilingual input poses no inherent challenges for statistical learners, and different bilingual exposure regimes don’t strongly affect learning outcomes, suggesting multilingual acquisition doesn’t necessarily lead to language learning delays.
Abstract: Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.
[42] When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment
Robinson Ferrer, Damla Turgut, Zhongzhou Chen, Shashank Sonkar
Main category: cs.CL
TL;DR: LLMs can grade but are unreliable; this paper focuses on predicting when LLM graders are likely correct using confidence estimation methods, finding self-reported confidence works best despite being cheaper.
Details
Motivation: LLMs show promise for automated grading but produce unreliable outputs. Instead of directly improving grading accuracy, the authors address the complementary problem of predicting when an LLM grader is likely to be correct, enabling selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review.Method: The paper compares three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science).
Result: Self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38% worse despite requiring 5× the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method, with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). Confidence is strongly top-skewed across methods, creating a “confidence floor” that practitioners must account for when setting thresholds.
Conclusion: Simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions, offering a cost-effective solution for selective automation in educational assessment.
Abstract: Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains 38% worse despite requiring 5$\times$ the inference cost. Larger models exhibit substantially better calibration though gains vary by dataset and method (e.g., a 28% ECE reduction for self-reported), with GPT-OSS-120B achieving the best calibration (avg ECE 0.100) and strong discrimination (avg AUC 0.668). We also observe that confidence is strongly top-skewed across methods, creating a ``confidence floor’’ that practitioners must account for when setting thresholds. These findings suggest that simply asking LLMs to report their confidence provides a practical approach for identifying reliable grading predictions. Code is available \href{https://github.com/sonkar-lab/llm_grading_calibration}{here}.
[43] Learning Diagnostic Reasoning for Decision Support in Toxicology
Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher
Main category: cs.CL
TL;DR: DeToxR uses RL-aligned LLMs to fuse unstructured narratives with structured medical data for multi-label substance identification in emergency toxicology, outperforming baselines and expert toxicologists.
Details
Motivation: Acute poly-substance intoxication requires rapid decisions under uncertainty with incomplete information. LLMs struggle in this setting despite potential for processing heterogeneous inputs, often underperforming simple baselines.Method: DeToxR adapts RL to emergency toxicology with a data-fusion engine for multi-label prediction across 14 substance classes. Uses LLM finetuned with Group Relative Policy Optimization (GRPO) and optimizes reasoning directly using clinical performance reward based on multi-label agreement metric.
Result: Model significantly outperforms unadapted base LLM counterpart and supervised baselines. In clinical validation, outperforms expert toxicologist in identifying correct poisons (Micro-F1: 0.644 vs. 0.473).
Conclusion: Demonstrates potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments like emergency toxicology.
Abstract: Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model’s reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.
[44] Agenda-based Narrative Extraction: Steering Pathfinding Algorithms with Large Language Models
Brian Felipe Keith-Norambuena, Carolina Inés Rojas-Córdova, Claudio Juvenal Meneses-Villegas, Elizabeth Johanna Lam-Esquenazi, Angélica María Flores-Bustos, Ignacio Alejandro Molina-Villablanca, Joshua Emanuel Leyton-Vallejos
Main category: cs.CL
TL;DR: LLM-integrated narrative extraction method that steers storyline construction toward user-specified agendas while maintaining coherence, enabling multiple perspectives from the same corpus.
Details
Motivation: Existing narrative extraction methods face trade-offs between coherence, interactivity, and multi-storyline support. Narrative Maps supports interaction and multiple storylines but lacks individual path coherence, while Narrative Trails achieves coherence but lacks user guidance and multiple perspectives.Method: Agenda-based narrative extraction integrates LLMs into Narrative Trails pathfinding. At each step, an LLM ranks candidate documents based on their alignment with a user-specified agenda while maintaining narrative coherence. Different agendas yield different storylines through the same corpus.
Result: LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on “Regime Crackdown” specifically (p=0.037). Coherence cost is minimal (only 2.2% reduction compared to agenda-agnostic baseline). Counter-agendas score uniformly low, confirming steering cannot fabricate unsupported narratives.
Conclusion: Agenda-based narrative extraction successfully bridges the gap between coherence and user guidance, enabling multiple storylines from the same corpus while maintaining narrative quality and preventing fabrication of unsupported narratives.
Abstract: Existing narrative extraction methods face a trade-off between coherence, interactivity, and multi-storyline support. Narrative Maps supports rich interaction and generates multiple storylines as a byproduct of its coverage constraints, though this comes at the cost of individual path coherence. Narrative Trails achieves high coherence through maximum capacity path optimization but provides no mechanism for user guidance or multiple perspectives. We introduce agenda-based narrative extraction, a method that bridges this gap by integrating large language models into the Narrative Trails pathfinding process to steer storyline construction toward user-specified perspectives. Our approach uses an LLM at each step to rank candidate documents based on their alignment with a given agenda while maintaining narrative coherence. Running the algorithm with different agendas yields different storylines through the same corpus. We evaluated our approach on a news article corpus using LLM judges with Claude Opus 4.5 and GPT 5.1, measuring both coherence and agenda alignment across 64 endpoint pairs and 6 agendas. LLM-driven steering achieves 9.9% higher alignment than keyword matching on semantic agendas (p=0.017), with 13.3% improvement on \textit{Regime Crackdown} specifically (p=0.037), while keyword matching remains competitive on agendas with literal keyword overlap. The coherence cost is minimal: LLM steering reduces coherence by only 2.2% compared to the agenda-agnostic baseline. Counter-agendas that contradict the source material score uniformly low (2.2-2.5) across all methods, confirming that steering cannot fabricate unsupported narratives.
[45] Near-Miss: Latent Policy Failure Detection in Agentic Workflows
Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor
Main category: cs.CL
TL;DR: A novel metric for detecting latent policy failures in LLM-based agentic workflows that bypass required policy checks but still reach correct outcomes, revealing blind spots in current evaluation methods.
Details
Motivation: Current evaluation of policy adherence in LLM-based agentic workflows only checks final system state against ground truth, missing subtle cases where agents bypass required policy checks but still reach correct outcomes due to favorable circumstances (near-misses or latent failures).Method: Building on ToolGuard framework (converts natural-language policies to executable guard code), the method analyzes agent conversation traces to determine whether tool-calling decisions were sufficiently informed by policy checks.
Result: Evaluation on τ²-verified Airlines benchmark shows latent failures occur in 8-17% of trajectories involving mutating tool calls, even when final outcome matches expected ground-truth state.
Conclusion: Current evaluation methodologies have a blind spot; there’s need for metrics that assess not only final outcomes but also the decision process leading to them, as latent policy failures are common in agentic workflows.
Abstract: Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent’s tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.
[46] ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian
Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi, Mehwish Alam
Main category: cs.CL
TL;DR: ENEIDE is a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts, created from two scholarly digital editions spanning 19th-20th centuries with over 8,000 entity annotations.
Details
Motivation: There's a lack of publicly available, multi-domain NERL datasets for historical Italian texts, particularly for training and evaluating models on temporal entity disambiguation across different time periods.Method: Semi-automatic annotation extraction from manually curated scholarly digital editions (Digital Zibaldone and Aldo Moro Digitale), with quality control procedures, entity linking to Wikidata, and inclusion of NIL entities for unmapped entities.
Result: Created a dataset of 2,111 documents with 8,000+ entity annotations covering person, location, organization, and literary work types, with baseline experiments showing the dataset’s challenge for NERL and performance gap between zero-shot and fine-tuned models.
Conclusion: ENEIDE is the first multi-domain, publicly available NERL dataset for historical Italian, suitable for temporal entity disambiguation and cross-domain evaluation, released under CC BY-NC-SA 4.0 license.
Abstract: This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798–1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916–1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset’s challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset’s diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.
[47] SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models
Adar Avsian, Larry Heck
Main category: cs.CL
TL;DR: SNEAK benchmark evaluates LLMs’ strategic communication abilities to share information with allies while keeping secrets from adversaries, revealing current models struggle with this capability compared to humans.
Details
Motivation: LLMs are increasingly used in multi-agent settings requiring strategic communication that balances informativeness with secrecy, but existing benchmarks don't measure this capability under asymmetric information.Method: Introduces SNEAK benchmark where models generate messages indicating knowledge of a secret word without revealing it clearly, evaluated using simulated agents (ally and chameleon) to measure utility and leakage.
Result: Current LLMs struggle with strategic communication under asymmetric information; human participants outperform all evaluated models by up to 4x, showing this remains a challenging capability.
Conclusion: Strategic communication balancing informativeness and secrecy is a challenging capability for current LLMs that requires specialized evaluation, with humans significantly outperforming models in this task.
Abstract: Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.
[48] Towards Empowering Consumers through Sentence-level Readability Scoring in German ESG Reports
Benjamin Josef Schüßler, Jakob Prange
Main category: cs.CL
TL;DR: This paper analyzes readability of German ESG reports using crowdsourced annotations, evaluates various readability scoring methods including LLM prompting and fine-tuned transformers, finding that fine-tuned transformers perform best at predicting human readability.
Details
Motivation: With growing sustainability focus and mandatory ESG reporting, companies need to communicate complex environmental, social, and governance information clearly to both expert and non-expert audiences. The paper investigates whether German ESG reports are written clearly enough for general public consumption.Method: Extended existing German ESG report dataset with crowdsourced readability annotations. Applied various readability scoring methods including traditional readability formulas, LLM prompting approaches, and fine-tuned transformer models. Evaluated methods based on prediction error and correlation with human rankings.
Result: Native speakers generally perceive ESG report sentences as easy to read, but readability is subjective. Fine-tuned transformer models achieved lowest error in predicting human readability. LLM prompting showed potential but was outperformed by fine-tuned models. Ensemble averaging slightly improved performance at inference cost.
Conclusion: ESG reports are generally readable but readability assessment benefits from machine learning approaches. Fine-tuned transformers offer best performance for automated readability assessment, though LLMs show promise for distinguishing clear vs. hard-to-read sentences.
Abstract: With the ever-growing urgency of sustainability in the economy and society, and the massive stream of information that comes with it, consumers need reliable access to that information. To address this need, companies began publishing so called Environmental, Social, and Governance (ESG) reports, both voluntarily and forced by law. To serve the public, these reports must be addressed not only to financial experts but also to non-expert audiences. But are they written clearly enough? In this work, we extend an existing sentence-level dataset of German ESG reports with crowdsourced readability annotations. We find that, in general, native speakers perceive sentences in ESG reports as easy to read, but also that readability is subjective. We apply various readability scoring methods and evaluate them regarding their prediction error and correlation with human rankings. Our analysis shows that, while LLM prompting has potential for distinguishing clear from hard-to-read sentences, a small finetuned transformer predicts human readability with the lowest error. Averaging predictions of multiple models can slightly improve the performance at the cost of slower inference.
[49] FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish
Daban Q. Jaff, Mohammad Mohammadamini
Main category: cs.CL
TL;DR: FLEURS-Kobani introduces the first Northern Kurdish benchmark dataset for speech tasks, extending FLEURS with 5,162 validated utterances (18+ hours) recorded by native speakers, with Whisper fine-tuning results for ASR and speech-to-text translation.
Details
Motivation: Northern Kurdish is not included in existing multilingual speech benchmarks like FLEURS, limiting benchmarking for automatic speech recognition and speech translation tasks in this under-resourced language variety.Method: Created FLEURS-Kobani dataset with 5,162 validated utterances (18h24m) recorded by 31 native speakers. Fine-tuned Whisper v3-large for ASR and end-to-end speech-to-text translation, using a two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani).
Result: Best ASR performance: WER 28.11, CER 9.84 on test set. For end-to-end speech-to-text translation (KMR to EN), Whisper achieved 8.68 BLEU on test. Also reported pivot-derived targets and cascaded S2TT setup.
Conclusion: FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, speech-to-text translation, and speech-to-speech translation tasks, addressing a gap for this under-resourced language.
Abstract: FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.
[50] Rewrite the News: Tracing Editorial Reuse Across News Agencies
Soveatin Kuntur, Nina Smirnova, Anna Wroblewska, Philipp Mayr, Sebastijan Razboršek Maček
Main category: cs.CL
TL;DR: A method for detecting cross-lingual sentence-level text reuse in journalism using weak supervision and publication timing to identify likely sources without full translations.
Details
Motivation: To address information overload for journalists by developing automated tools to detect sentence-level text reuse across languages, particularly in multilingual journalism where content is often reused and adapted.Method: Weakly supervised method for cross-lingual reuse detection without full translations; compares English articles from Slovenian Press Agency with reports from 15 foreign agencies in 7 languages; uses publication timestamps to identify earliest likely sources; analyzes 1,037 STA and 237,551 FA articles across two time periods.
Result: Found 1,087 aligned sentence pairs after filtering; reuse occurs in 52% of STA articles and 1.6% of FA articles; reuse is predominantly non-literal (paraphrase/compositional); reused content tends to appear in middle/end of articles while leads are more original.
Conclusion: The method successfully detects cross-lingual text reuse without full translations, revealing substantial editorial reuse patterns that lexical matching would miss, with implications for journalism and information overload management.
Abstract: This paper investigates sentence-level text reuse in multilingual journalism, analyzing where reused content occurs within articles. We present a weakly supervised method for detecting sentence-level cross-lingual reuse without requiring full translations, designed to support automated pre-selection to reduce information overload for journalists (Holyst et al., 2024). The study compares English-language articles from the Slovenian Press Agency (STA) with reports from 15 foreign agencies (FA) in seven languages, using publication timestamps to retain the earliest likely foreign source for each reused sentence. We analyze 1,037 STA and 237,551 FA articles from two time windows (October 7-November 2, 2023; February 1-28, 2025) and identify 1,087 aligned sentence pairs after filtering to the earliest sources. Reuse occurs in 52% of STA articles and 1.6% of FA articles and is predominantly non-literal, involving paraphrase and compositional reuse from multiple sources. Reused content tends to appear in the middle and end of English articles, while leads are more often original, indicating that simple lexical matching overlooks substantial editorial reuse. Compared with prior work focused on monolingual overlap, we (i) detect reuse across languages without requiring full translation, (ii) use publication timing to identify likely sources, and (iii) analyze where reused material is situated within articles. Dataset and code: https://github.com/kunturs/lrec2026-rewrite-news.
[51] Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior
Junwei Yu, Mufeng Yang, Yepeng Ding, Hiroyuki Sato
Main category: cs.CL
TL;DR: GEO-SFE: A structural feature engineering framework for optimizing content visibility in AI-powered search engines by modifying document structure at macro, meso, and micro levels to improve citation rates.
Details
Motivation: AI search engines have shifted from link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. Existing GEO approaches focus on semantic content modification, while structural features' impact on citation behavior remains underexplored.Method: Proposes GEO-SFE framework that decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis). Develops architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness.
Result: Experimental evaluation across six mainstream generative engines shows consistent improvements in citation rate (17.3%) and subjective quality (18.5%), validating effectiveness and generalizability.
Conclusion: Establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.
Abstract: The proliferation of AI-powered search engines has shifted information discovery from traditional link-based retrieval to direct answer generation with selective source citation, creating new challenges for content visibility. While existing Generative Engine Optimization (GEO) approaches focus primarily on semantic content modification, the role of structural features in influencing citation behavior remains underexplored. In this paper, we propose GEO-SFE, a systematic framework for structural feature engineering in generative engine optimization. Our approach decomposes content structure into three hierarchical levels: macro-structure (document architecture), meso-structure (information chunking), and micro-structure (visual emphasis), and models their impact on citation probability across different generative engine architectures. We develop architecture-aware optimization strategies and predictive models that preserve semantic integrity while improving structural effectiveness. Experimental evaluation across six mainstream generative engines demonstrates consistent improvements in citation rate (17.3 percent) and subjective quality (18.5 percent), validating the effectiveness and generalizability of the proposed framework. This work establishes structural optimization as a foundational component of GEO, providing a data-driven methodology for enhancing content visibility in LLM-powered information ecosystems.
[52] Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives
Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis, Frank van Harmelen, Filip Ilievski
Main category: cs.CL
TL;DR: YARN framework uses LLMs to decompose narratives into units, abstract them at four levels, then maps elements across stories for analogical reasoning, showing abstractions improve performance over end-to-end LLM baselines.
Details
Motivation: Analogical reasoning in narratives is challenging for machines - cognitive engines need pre-extracted entities while LLMs are sensitive to prompt format and surface similarity. The paper investigates whether enhancing structural mapping with LLM-derived abstractions improves analogical reasoning in narratives.Method: Proposes YARN framework: 1) LLMs decompose narratives into units, 2) abstract units at four levels capturing general meaning and story roles (grounded in framing theory), 3) mapping component aligns elements across stories for analogical reasoning. Modular design enables systematic analysis.
Result: Abstractions consistently improve model performance, achieving competitive or better results than end-to-end LLM baselines. Error analysis reveals challenges in abstraction level selection, incorporating implicit causality, and identifies emerging categorization of analogical patterns in narratives.
Conclusion: YARN demonstrates that LLM-derived abstractions enhance analogical reasoning in narratives, provides systematic framework for analysis, and identifies key challenges for future work. Code is made openly available.
Abstract: Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs’ performance is sensitive to prompt format and the degree of surface similarity between narratives. This gap motivates a key question: What is the impact of enhancing structural mapping with LLM-derived abstractions on their analogical reasoning ability in narratives? To that end, we propose a modular framework named YARN (Yielding Abstractions for Reasoning in Narratives), which uses LLMs to decompose narratives into units, abstract these units, and then passes them to a mapping component that aligns elements across stories to perform analogical reasoning. We define and operationalize four levels of abstraction that capture both the general meaning of units and their roles in the story, grounded in prior work on framing. Our experiments reveal that abstractions consistently improve model performance, resulting in competitive or better performance than end-to-end LLM baselines. Closer error analysis reveals the remaining challenges in abstraction at the right level, in incorporating implicit causality, and an emerging categorization of analogical patterns in narratives. YARN enables systematic variation of experimental settings to analyze component contributions, and to support future work, we make the code for YARN openly available.
[53] ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga
Main category: cs.CL
TL;DR: ContextClaim: A retrieval-augmented approach for verifiable claim detection that uses entity extraction and Wikipedia context retrieval to improve classification of factual claims.
Details
Motivation: Existing claim detection methods rely solely on claim text, ignoring contextual information that could help determine if a claim is verifiable. The authors propose that knowing what entities and events a claim refers to, and whether relevant information exists for verification, could improve detection accuracy.Method: ContextClaim extracts entity mentions from input claims, retrieves relevant information from Wikipedia as structured knowledge, uses LLMs to produce concise contextual summaries, and employs these summaries for downstream classification. Evaluated across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings.
Result: Context augmentation improves verifiable claim detection, though effectiveness varies across domains, model architectures, and learning settings. Component analysis, human evaluation, and error analysis reveal when and why retrieved context contributes to more reliable verifiability judgments.
Conclusion: Retrieving contextual information can enhance verifiable claim detection, representing a paradigm shift from text-only approaches to context-driven methods that mirror human fact-checking processes.
Abstract: Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
[54] The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg
Main category: cs.CL
TL;DR: LLMs’ em dash overuse stems from markdown formatting leaking into prose due to training on markdown-saturated corpora, serving as a diagnostic for fine-tuning methods rather than just a stylistic flaw.
Details
Motivation: To explain why LLMs overuse em dashes and connect this observation to the parallel finding that LLMs default to markdown-formatted output, providing a mechanistic account of this AI-generated text pattern.Method: Proposed a five-step genealogy connecting training data, structural internalization, and post-training amplification. Conducted suppression experiments across 12 models from 5 providers (Anthropic, OpenAI, Meta, Google, DeepSeek) with varying conditions to test markdown influence on em dash usage.
Result: When instructed to avoid markdown formatting, overt markdown features were eliminated but em dashes persisted (except in Meta’s Llama models). Em dash frequency varied from 0.0 to 9.1 per 1,000 words, functioning as a signature of specific fine-tuning procedures. Even explicit em dash prohibition failed to eliminate the artifact in some models.
Conclusion: Em dash overuse results from markdown leaking into prose due to training on markdown-saturated corpora, and serves as a diagnostic of fine-tuning methodology rather than merely a stylistic defect, connecting two previously isolated online discourses.
Abstract: Large language models produce em dashes at varying rates, and the observation that some models “overuse” them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose – the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting, overt features (headers, bullets, bold) are eliminated or nearly eliminated, but em dashes persist – except in Meta’s Llama models, which produce none at all. Em dash frequency and suppression resistance vary from 0.0 per 1,000 words (Llama) to 9.1 (GPT-4.1 under suppression), functioning as a signature of the specific fine-tuning procedure applied. A three-condition suppression gradient shows that even explicit em dash prohibition fails to eliminate the artifact in some models, and a base-vs-instruct comparison confirms that the latent tendency exists pre-RLHF. These findings connect two previously isolated online discourses and reframe em dash frequency as a diagnostic of fine-tuning methodology rather than a stylistic defect.
[55] SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy
Meghal Dani, Muthu Jeyanthi Prakash, Filip Rosa, Zeynep Akata, Stefanie Liebe
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2407.03004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.03004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[56] Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository
Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, Benjamin W. Domingue
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.20663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.20663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[57] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns, Letitia Parcalabescu, Stephan Wäldchen, Michael Barlow, Gregor Ziegltrum, Volker Stampa, Bastian Harren, Björn Deiseroth
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.00022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.00022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[58] A Reality Check of Language Models as Formalizers on Constraint Satisfaction Problems
Rikhil Amonkar, Ceyhun Efe Kayan, Qimei Lai, Ronan Le Bras, Li Zhang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.13252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[59] Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, Linfeng Zhang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.10848: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10848&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[60] QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Jiazheng Li, Hongzhou Lin, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Yi Wu, Jingzhao Zhang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2507.13266 suggests it’s from July 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing content retrieval.Method: No method information available due to failed content fetch. The error suggests technical limitations in accessing the paper data.
Result: No results available. The analysis cannot proceed without paper content due to API rate limiting issues.
Conclusion: Unable to analyze paper relevance due to technical limitations in accessing content. Need to wait for rate limits to reset or use alternative access methods.
Abstract: Failed to fetch summary for 2507.13266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[61] Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.14292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[62] Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.23067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[63] ProxyAttn: Guided Sparse Attention via Representative Heads
Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che
Main category: cs.CL
TL;DR: Paper 2509.24745: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2509.24745: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24745&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[64] Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible
Imry Ziv, Nur Lan, Emmanuel Chemla
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.07178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[65] ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
Shivanshu Kumar, Gopalakrishnan Srinivasan
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.13860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[66] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Romanized Scripts in a Real World Setting
Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.10780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[67] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim, Gary Geunbae Lee
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.04448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[68] AXE: Low-Cost Cross-Domain Web Structured Information Extraction
Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.01838: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01838&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[69] $V_0$: A Generalist Value Model for Any Policy at State Zero
Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.03584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[70] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
Abdulmuizz Khalak, Abderrahmane Issam, Gerasimos Spanakis
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2602.09826: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09826&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[71] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2602.14265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[72] When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.00314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[73] AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents
Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.12564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[74] Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
Mengyu Bu, Yang Feng
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.17512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[75] Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis
Andor Diera, Ansgar Scherp
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper content due to technical error in fetching
Abstract: Failed to fetch summary for 2603.17624: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17624&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[76] How do LLMs Compute Verbal Confidence
Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.17839 appears to be from March 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2603.17839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[77] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction
Hui Wen Goh, Jonas Mueller
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.18014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[78] Inducing Sustained Creativity and Diversity in Large Language Models
Queenie Luo, Gary King, Michael Puett, Michael D. Smith
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.19519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[79] Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation
Lorcan McLaren, James Cross, Zuzanna Krakowska, Robin Rauner, Martijn Schoonvelde
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.26898: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26898&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[80] Training data generation for context-dependent rubric-based short answer grading
Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.28537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[81] EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
Hannes Kunstmann, Joseph Ollier, Joel Persson, Florian von Wangenheim
Main category: cs.CL
TL;DR: Unable to analyze paper 2407.04472 due to HTTP 429 error when fetching the abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot determine conclusion as abstract retrieval failed
Abstract: Failed to fetch summary for 2407.04472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.04472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[82] MindCube: Spatial Mental Modeling from Limited Views
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, Manling Li
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.21458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[83] Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang
Main category: cs.CL
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2509.11452: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11452&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[84] CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.21035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[85] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.21223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.22097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[87] From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models
Chao Wu, Baoheng Li, Mingchen Gao, Yu Tian, Zhenyi Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.10788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] Stronger Normalization-Free Transformers
Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.10938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation
Junichiro Niimi
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.17094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] How to Train Your Long-Context Visual Document Model
Austin Veselka
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.15257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[91] DF-ACBlurGAN: Structure-Aware Conditional Generation of Internally Repeated Patterns for Biomaterial Microtopography Design
Rongjun Dong, Xin Chen, Morgan R Alexander, Karthikeyan Sivakumar, Reza Omdivar, David A Winkler, Grazziela Figueredo
Main category: cs.CV
TL;DR: DF-ACBlurGAN: A structure-aware conditional GAN for generating images with repeated patterns, applied to biomaterial surface design with frequency-domain repetition estimation and scale-adaptive blurring.
Details
Motivation: Existing generative models struggle with internally repeated and periodic structures because they're optimized for local texture statistics rather than global structural consistency. This is particularly problematic for applications like biomaterial surfaces that require strict control over repetition scale, spacing, and boundary coherence.Method: Proposes DF-ACBlurGAN, a structure-aware conditional GAN that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity.
Result: Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches. The model successfully synthesizes designs aligned with target functional outcomes when conditioned on biological response labels.
Conclusion: The proposed approach effectively addresses the challenge of generating images with repeated patterns, particularly for biomaterial design applications requiring strict structural control under weak supervision and class imbalance conditions.
Abstract: Learning to generate images with internally repeated and periodic structures poses a fundamental challenge for machine learning and computer vision models, which are typically optimised for local texture statistics and semantic realism rather than global structural consistency. This limitation is particularly pronounced in applications requiring strict control over repetition scale, spacing, and boundary coherence, such as microtopographical biomaterial surfaces. In this work, biomaterial design serves as a use case to study conditional generation of repeated patterns under weak supervision and class imbalance. We propose DF-ACBlurGAN, a structure-aware conditional generative adversarial network that explicitly reasons about long-range repetition during training. The approach integrates frequency-domain repetition scale estimation, scale-adaptive Gaussian blurring, and unit-cell reconstruction to balance sharp local features with stable global periodicity. Conditioning on experimentally derived biological response labels, the model synthesises designs aligned with target functional outcomes. Evaluation across multiple biomaterial datasets demonstrates improved repetition consistency and controllable structural variation compared to conventional generative approaches.
[92] OccSim: Multi-kilometer Simulation with Long-horizon Occupancy World Models
Tianran Liu, Shengwen Zhao, Mozhgan Pourkeshavarz, Weican Li, Nicholas Rhinehart
Main category: cs.CV
TL;DR: OccSim: First occupancy world model-driven 3D simulator for autonomous driving that generates continuous 3D occupancy maps from single initial frame and ego-actions, enabling large-scale simulation without HD maps or pre-recorded logs.
Details
Motivation: Current data-driven autonomous driving simulation is limited by dependency on pre-recorded driving logs and HD maps, restricting scalability and open-ended generation capabilities to finite datasets.Method: Uses W-DiT based static occupancy world model for ultra-long-horizon static environment generation with explicit rigid transformations, and Layout Generator for populating dynamic foreground with reactive agents based on synthesized road topology.
Result: Generates over 3,000 continuous frames (4+ km), >80x improvement in stable generation length. Data from OccSim pre-trains 4D semantic occupancy forecasting models achieving 67% zero-shot performance, scaling to 74% with 5x dataset size.
Conclusion: OccSim breaks scalability bottleneck in autonomous driving simulation, enabling large-scale 3D occupancy generation without HD maps or continuous logs, with demonstrated downstream utility for occupancy forecasting models.
Abstract: Data-driven autonomous driving simulation has long been constrained by its heavy reliance on pre-recorded driving logs or spatial priors, such as HD maps. This fundamental dependency severely limits scalability, restricting open-ended generation capabilities to the finite scale of existing collected datasets. To break this bottleneck, we present OccSim, the first occupancy world model-driven 3D simulator. OccSim obviates the requirement for continuous logs or HD maps; conditioned only on a single initial frame and a sequence of future ego-actions, it can stably generate over 3,000 continuous frames, enabling the continuous construction of large-scale 3D occupancy maps spanning over 4 kilometers for simulation. This represents an >80x improvement in stable generation length over previous state-of-the-art occupancy world models. OccSim is powered by two modules: W-DiT based static occupancy world model and the Layout Generator. W-DiT handles the ultra-long-horizon generation of static environments by explicitly introducing known rigid transformations in architecture design, while the Layout Generator populates the dynamic foreground with reactive agents based on the synthesized road topology. With these designs, OccSim can synthesize massive, diverse simulation streams. Extensive experiments demonstrate its downstream utility: data collected directly from OccSim can pre-train 4D semantic occupancy forecasting models to achieve up to 67% zero-shot performance on unseen data, outperforming previous asset-based simulator by 11%. When scaling the OccSim dataset to 5x the size, the zero-shot performance increases to about 74%, while the improvement over asset-based simulators expands to 22.1%.
[93] Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses
Ruxiao Duan, Erin Hong, Dongxu Zhao, Eric Turner, Alex Wong, Yunwen Zhou
Main category: cs.CV
TL;DR: Fisheye3R adapts multi-view 3D reconstruction foundation models to handle fisheye camera inputs without performance regression on perspective images, addressing the scarcity of fisheye training data through flexible self-supervised and supervised learning schemes.
Details
Motivation: Existing foundation models for multi-view 3D reconstruction degrade on wide field-of-view fisheye images due to non-linear projection models, and there's insufficient fisheye training data with ground truth to retrain them effectively.Method: Proposes Fisheye3R adaptation framework with flexible learning schemes: self-supervised adaptation using only unlabeled perspective images, and supervised adaptation without any fisheye training data. Handles the spatial position changes from fisheye distortion.
Result: Extensive experiments across three foundation models (VGGT, π³, and MapAnything) show consistent improvements in camera pose, depth, point map, and field-of-view estimation on fisheye images.
Conclusion: Fisheye3R successfully extends multi-view 3D reconstruction foundation models to natively handle fisheye inputs while maintaining performance on perspective images, addressing the data scarcity problem through innovative adaptation methods.
Abstract: Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, $π^3$, and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.
[94] Decoding Functional Networks for Visual Categories via GNNs
Shira Karmi, Galia Avidan, Tammy Riklin Raviv
Main category: cs.CV
TL;DR: A signed Graph Neural Network with sparse edge mask and class-specific saliency decodes visual category-specific functional connectivity from 7T fMRI data, revealing reproducible subnetworks along visual pathways.
Details
Motivation: To understand how large-scale brain networks represent visual categories and bridge perception with cortical organization by moving beyond voxel-level category selectivity to connectivity-based representations.Method: Uses high-resolution 7T fMRI from Natural Scenes Dataset, constructs parcel-level functional graphs, and trains a signed Graph Neural Network that models both positive and negative interactions with sparse edge mask and class-specific saliency.
Result: The model accurately decodes category-specific functional connectivity states (sports, food, vehicles) and reveals reproducible, biologically meaningful subnetworks along ventral and dorsal visual pathways.
Conclusion: This framework successfully bridges machine learning and neuroscience by extending voxel-level category selectivity to a connectivity-based representation of visual processing.
Abstract: Understanding how large-scale brain networks represent visual categories is fundamental to linking perception and cortical organization. Using high-resolution 7T fMRI from the Natural Scenes Dataset, we construct parcel-level functional graphs and train a signed Graph Neural Network that models both positive and negative interactions, with a sparse edge mask and class-specific saliency. The model accurately decodes category-specific functional connectivity states (sports, food, vehicles) and reveals reproducible, biologically meaningful subnetworks along the ventral and dorsal visual pathways. This framework bridges machine learning and neuroscience by extending voxel-level category selectivity to a connectivity-based representation of visual processing.
[95] Hybrid Quantum-Classical AI for Industrial Defect Classification in Welding Images
Akshaya Srinivasan, Xiaoyin Cheng, Jianming Yi, Alexander Geng, Desislava Ivanova, Andreas Weinmann, Ali Moghiseh
Main category: cs.CV
TL;DR: Hybrid quantum-classical models for weld defect classification show competitive performance compared to classical CNN, demonstrating potential for industrial quality control applications.
Details
Motivation: To explore hybrid quantum-classical machine learning approaches for automated quality control in industrial settings, specifically for defect classification in aluminium TIG welding images, and benchmark against conventional deep learning methods.Method: Two hybrid quantum-classical approaches: 1) Quantum kernel method using CNN-extracted features encoded into quantum states via parameterized quantum feature maps, with quantum kernel matrix computation and Variational Quantum Linear Solver (VQLS) for SVM optimization; 2) Angle encoding of CNN features in variational quantum circuits with classical optimizer training. Both compared against classical CNN model on binary and multiclass classification tasks.
Result: While CNN model demonstrates robust performance, hybrid quantum-classical models perform competitively, highlighting their potential for near-term real-world applications in industrial defect detection.
Conclusion: Hybrid quantum-classical approaches show promise for industrial quality control applications, though classical CNN remains robust; quantum methods demonstrate competitive performance warranting further investigation.
Abstract: Hybrid quantum-classical machine learning offers a promising direction for advancing automated quality control in industrial settings. In this study, we investigate two hybrid quantum-classical approaches for classifying defects in aluminium TIG welding images and benchmarking their performance against a conventional deep learning model. A convolutional neural network is used to extract compact and informative feature vectors from weld images, effectively reducing the higher-dimensional pixel space to a lower-dimensional feature space. Our first quantum approach encodes these features into quantum states using a parameterized quantum feature map composed of rotation and entangling gates. We compute a quantum kernel matrix from the inner products of these states, defining a linear system in a higher-dimensional Hilbert space corresponding to the support vector machine (SVM) optimization problem and solving it using a Variational Quantum Linear Solver (VQLS). We also examine the effect of the quantum kernel condition number on classification performance. In our second method, we apply angle encoding to the extracted features in a variational quantum circuit and use a classical optimizer for model training. Both quantum models are tested on binary and multiclass classification tasks and the performance is compared with the classical CNN model. Our results show that while the CNN model demonstrates robust performance, hybrid quantum-classical models perform competitively. This highlights the potential of hybrid quantum-classical approaches for near-term real-world applications in industrial defect detection and quality assurance.
[96] Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Felix Wimbauer, Fabian Manhardt, Michael Oechsle, Nikolai Kalischek, Christian Rupprecht, Daniel Cremers, Federico Tombari
Main category: cs.CV
TL;DR: Stepper is a unified framework for text-driven immersive 3D scene synthesis that uses stepwise panoramic expansion with a multi-view 360° diffusion model to achieve high fidelity and geometric coherence.
Details
Motivation: Existing approaches for text-to-3D scene synthesis face a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. There's a need for better methods that can generate high-quality, consistent 3D scenes from text prompts.Method: Stepper uses stepwise panoramic scene expansion with a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion. It’s coupled with a geometry reconstruction pipeline that enforces geometric coherence. The system is trained on a new large-scale, multi-view panorama dataset.
Result: Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches in immersive scene generation. It sets a new standard for text-driven 3D scene synthesis.
Conclusion: Stepper provides a unified framework that circumvents the limitations of existing approaches by using stepwise panoramic expansion with geometric coherence enforcement, enabling high-quality immersive 3D scene generation from text.
Abstract: The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.
[97] The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
Kushal Vyas, Alper Kayabasi, Daniel Kim, Vishwanath Saragadam, Ashok Veeraraghavan, Guha Balakrishnan
Main category: cs.CV
TL;DR: Pretraining INRs on different noise types reveals unstructured noise improves signal fitting but not inverse imaging, while structured spectral noise balances both capabilities.
Details
Motivation: To understand why data-driven initialization methods for implicit neural representations (INRs) work better than random initialization, and whether they encode classical statistical signal priors or more complex features.Method: Pretrain INRs on diverse noise classes (Gaussian, Dead Leaves, Spectral) and evaluate their ability to fit unseen signals and serve as priors for denoising tasks on image and video data.
Result: Unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity but yields poor deep image priors for denoising. Noise with natural image spectral structure ($1/|f|^α$) achieves excellent balance of signal fitting and inverse imaging capabilities, performing on par with best data-driven initialization methods.
Conclusion: Structured spectral noise pretraining provides an efficient alternative to data-driven initialization for INRs when domain-specific data is scarce, revealing that successful initialization methods encode classical statistical signal priors rather than complex features.
Abstract: The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. While several data-driven initialization methods demonstrate significant improvements over standard random sampling, the reasons for their success – specifically, whether they encode classical statistical signal priors or more complex features – remain poorly understood. In this study, we explore this phenomenon through a series of experimental analyses leveraging noise pretraining. We pretrain INRs on diverse noise classes (e.g., Gaussian, Dead Leaves, Spectral) and measure their ability to both fit unseen signals and encode priors for an inverse imaging task (denoising). Our analyses on image and video data reveal a surprising finding: simply pretraining on unstructured noise (Uniform, Gaussian) dramatically improves signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, we also find that noise with the classic $1/|f^α|$ spectral structure of natural images achieves an excellent balance of signal fitting and inverse imaging capabilities, performing on par with the best data-driven initialization methods. This finding enables more efficient INR training in applications lacking sufficient prior domain-specific data. For more details, visit project page at https://kushalvyas.github.io/noisepretraining.html
[98] GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates
Youngjoong Kwon, Yao He, Heejung Choi, Chen Geng, Zhengmao Liu, Jiajun Wu, Ehsan Adeli
Main category: cs.CV
TL;DR: A feed-forward human performance capture method that renders novel views from monocular RGB video by maintaining a canonical space that accumulates appearance information over time, using probabilistic regression to resolve conflicts between past and current observations.
Details
Motivation: The challenge in monocular human performance capture is the lack of sufficient observations, especially for unseen regions. The authors aim to leverage temporal continuity to accumulate appearance information from multiple frames to address this limitation.Method: Maintains a canonical space that is progressively updated with each incoming frame, accumulating appearance information over time. Uses probabilistic regression to effectively utilize this context while respecting the deformation of the live state, resolving conflicts between past and current observations.
Result: Demonstrates effectiveness on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets, producing sharper reconstructions than deterministic regression approaches and enabling plausible synthesis even in regions with no prior observations.
Conclusion: The proposed method successfully addresses the challenge of insufficient observations in monocular human performance capture by leveraging temporal continuity through a canonical space and probabilistic regression, enabling high-quality novel view synthesis.
Abstract: We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.
[99] Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng
Main category: cs.CV
TL;DR: Unify-Agent: A unified multimodal agent for world-grounded image synthesis that uses agentic pipeline (prompt understanding, evidence searching, recaptioning, synthesis) to address limitations of frozen parametric knowledge in generating images with long-tail factual concepts.
Details
Motivation: Current unified multimodal models rely on frozen parametric knowledge and struggle with real-world image generation involving long-tail and knowledge-intensive concepts. The paper explores agentic modeling to address this limitation by enabling external knowledge grounding.Method: Unify-Agent reframes image generation as an agentic pipeline with four stages: prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. The model is trained on 143K high-quality agent trajectories curated for world-grounded image synthesis, with a tailored multimodal data pipeline. Also introduces FactIP benchmark for evaluating factual concept generation.
Result: Unify-Agent substantially improves over its base unified model across diverse benchmarks and real-world generation tasks, approaching the world knowledge capabilities of the strongest closed-source models. The agentic approach demonstrates value for reliable open-world image synthesis.
Conclusion: Agent-based modeling for world-grounded image synthesis shows promise, highlighting the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis. This represents an early exploration of agentic approaches to multimodal generation.
Abstract: Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.
[100] Fine-grained Image Quality Assessment for Perceptual Image Restoration
Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, Leida Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2508.14475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] MEDiC: Multi-objective Exploration of Distillation from CLIP
Konstantinos Georgiou, Maofeng Tang, Hairong Qi
Main category: cs.CV
TL;DR: MEDiC combines pixel reconstruction, token distillation from CLIP, and global alignment in a multi-objective MIM framework, achieving strong performance but revealing fragility in loss weighting and limitations of evolved masking with semantic teachers.
Details
Motivation: Current masked image modeling methods operate either in raw pixel space or latent feature space, but not both. The authors aim to combine these complementary approaches to leverage the strengths of both reconstruction and semantic distillation.Method: Proposes MEDiC framework with three objectives: 1) patch-level token distillation from frozen CLIP encoder, 2) global CLS alignment, and 3) pixel reconstruction via lightweight decoder. Also explores hierarchical clustering with relative position bias for evolved masking.
Result: Achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs. All three objectives provide complementary information. Evolved masking doesn’t outperform simple block masking in teacher-guided setting. Loss weights are extremely fragile with small perturbations causing up to 17% accuracy drops.
Conclusion: Combining pixel and latent space objectives in MIM is effective, but careful tuning is required. Teacher’s semantic awareness reduces benefits of sophisticated masking strategies. The framework demonstrates strong performance but highlights sensitivity to hyperparameter choices.
Abstract: Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher’s inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.
[102] UltraG-Ray: Physics-Based Gaussian Ray Casting for Novel Ultrasound View Synthesis
Felix Duelmer, Jakob Klaushofer, Magdalena Wysocki, Nassir Navab, Mohammad Farid Azampour
Main category: cs.CV
TL;DR: UltraG-Ray: A novel ultrasound scene representation using learnable 3D Gaussian fields with physics-based B-mode synthesis that captures view-dependent attenuation effects for more realistic novel view synthesis.
Details
Motivation: Current ultrasound novel view synthesis methods struggle with complex tissue and view-dependent acoustic effects, creating a gap between simulation and reality. Physics-based approaches that incorporate ultrasound image formation processes are needed for more realistic synthesis.Method: Introduces UltraG-Ray with a learnable 3D Gaussian field representation that explicitly encodes ultrasound-specific parameters (attenuation, reflection) and uses a novel ray casting scheme for physics-based B-mode synthesis.
Result: Achieves consistent gains in image quality metrics (up to 15% increase on MS-SSIM) compared to state-of-the-art methods, demonstrating clear improvement in realism of synthesized ultrasound images.
Conclusion: The approach naturally captures view-dependent attenuation effects and enables generation of physically informed B-mode images with increased realism, advancing ultrasound novel view synthesis capabilities.
Abstract: Novel view synthesis (NVS) in ultrasound has gained attention as a technique for generating anatomically plausible views beyond the acquired frames, offering new capabilities for training clinicians or data augmentation. However, current methods struggle with complex tissue and view-dependent acoustic effects. Physics-based NVS aims to address these limitations by including the ultrasound image formation process into the simulation. Recent approaches combine a learnable implicit scene representation with an ultrasound-specific rendering module, yet a substantial gap between simulation and reality remains. In this work, we introduce UltraG-Ray, a novel ultrasound scene representation based on a learnable 3D Gaussian field, coupled to an efficient physics-based module for B-mode synthesis. We explicitly encode ultrasound-specific parameters, such as attenuation and reflection, into a Gaussian-based spatial representation and realize image synthesis within a novel ray casting scheme. In contrast to previous methods, this approach naturally captures view-dependent attenuation effects, thereby enabling the generation of physically informed B-mode images with increased realism. We compare our method to state-of-the-art and observe consistent gains in image quality metrics (up to 15% increase on MS-SSIM), demonstrating clear improvement in terms of realism of the synthesized ultrasound images.
[103] MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Bharath Krishnamurthy, Ajita Rattani
Main category: cs.CV
TL;DR: MMFace-DiT introduces a unified dual-stream diffusion transformer for synergistic multimodal face synthesis, achieving 40% improvement in visual fidelity and prompt alignment over state-of-the-art methods.
Details
Motivation: Existing multimodal face generation models have limitations: they typically extend pre-trained text-to-image pipelines with ad hoc designs that inherit architectural constraints, duplicate parameters, and fail under conflicting modalities or mismatched latent spaces, limiting synergistic fusion across semantic and spatial domains.Method: A unified dual-stream diffusion transformer with dual-stream transformer blocks that process spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through shared Rotary Position-Embedded (RoPE) Attention mechanism. Includes a novel Modality Embedder for dynamic adaptation to varying spatial conditions without retraining.
Result: Achieves 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling.
Conclusion: MMFace-DiT provides a unified architecture for synergistic multimodal fusion that prevents modal dominance and ensures strong adherence to both text and structural priors, enabling unprecedented spatial-semantic consistency for controllable face generation.
Abstract: Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/
[104] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.01009 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2510.01009: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01009&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
Wanying Qu, Jianxiong Gao, Wei Wang, Yanwei Fu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.24176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
Yujin Ham, Junho Kim, Vivek Boominathan, Guha Balakrishnan
Main category: cs.CV
TL;DR: A generative video inpainting method that removes humans and shadows from egocentric walking tour videos using a semi-synthetic dataset and fine-tuned diffusion models, enabling better 3D/4D environment modeling.
Details
Motivation: Egocentric walking tour videos contain rich environmental data but are often filled with humans and shadows that hinder their use for environment modeling applications. The paper aims to address this by developing a method to realistically remove humans and their shadows from such videos.Method: Constructs a semi-synthetic dataset of video clip pairs: environment-only background clips and composite clips with walking humans and simulated shadows overlaid. Uses real egocentric walking tour videos for both foreground and background components. Fine-tunes the Casper video diffusion model for object and effects inpainting using this dataset.
Result: The resulting model performs significantly better than the original Casper model both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. The generated clips can successfully be used to build 3D/4D models of urban locations.
Conclusion: The proposed approach effectively addresses the challenge of human presence in egocentric walking tour videos through a semi-synthetic dataset and fine-tuned diffusion model, enabling better environment modeling applications from such video data.
Abstract: Egocentric “walking tour” videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking tour videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D/4D models of urban locations.
[107] ANVIL: Accelerator-Native Video Interpolation via Codec Motion Vector Priors
Shibo Liu
Main category: cs.CV
TL;DR: Failed to fetch summary for arXiv ID 2603.26835 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot provide analysis due to API rate limiting preventing paper retrieval
Abstract: Failed to fetch summary for 2603.26835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] Let the Abyss Stare Back Adaptive Falsification for Autonomous Scientific Discovery
Peiran Li, Fangzhou Lin, Shuo Xing, Jiashuo Sun, Dylan Zhang, Siyuan Yang, Chaoqun Ni, Zhengzhong Tu
Main category: cs.CV
TL;DR: DASES is a falsification-driven framework for autonomous scientific discovery that uses adaptive falsification instead of static validation to prevent models from learning to pass tests without understanding underlying mechanisms.
Details
Motivation: Current autonomous scientific discovery approaches risk allowing models to learn to pass evaluations without actually understanding the underlying mechanisms, similar to "teaching to the test." The authors propose moving from passive validation to active falsification to ensure genuine mechanistic understanding.Method: DASES introduces a three-component framework: an Innovator that creates scientific artifacts, an Abyss Falsifier that generates adversarial counterexamples, and a Mechanistic Causal Extractor that ensures scientific admissibility. These components co-evolve under a fixed scientific contract to push against candidates through adaptive falsification.
Result: In controlled experiments, DASES rejects artifacts that static validation would accept, identifies candidates that survive admissible falsification, and discovers FNG-CE - a loss function that transfers beyond synthetic environments and outperforms CE and CE+L2 across standard benchmarks including ImageNet.
Conclusion: Active falsification through frameworks like DASES is necessary for genuine scientific discovery, preventing models from learning superficial patterns to pass tests while missing underlying mechanisms. The approach enables discovery of robust, transferable solutions.
Abstract: Autonomous scientific discovery is entering a more dangerous regime: once the evaluator is frozen, a sufficiently strong search process can learn to win the exam without learning the mechanism the task was meant to reveal. This is the idea behind our title. To let the abyss stare back is to make evaluation actively push against the candidate through adaptive falsification, rather than passively certify it through static validation. We introduce DASES, a falsification-driven framework in which an Innovator, an Abyss Falsifier, and a Mechanistic Causal Extractor co-evolve executable scientific artifacts and scientifically admissible counterexample environments under a fixed scientific contract. In a controlled loss-discovery problem with a single editable locus, DASES rejects artifacts that static validation would have accepted, identifies the first candidate that survives the admissible falsification frontier, and discovers FNG-CE, a loss that transfers beyond the synthetic discovery environment and consistently outperforms CE and CE+L2 under controlled comparisons across standard benchmarks, including ImageNet.
[109] Generative AI Enables Structural Brain Network Construction from fMRI via Symmetric Diffusion Learning
Qiankun Zuo, Bangjun Lei, Wanyu Qiu, Changhong Jing, Jin Hong, Shuqiang Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2309.16205: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.16205&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy
Main category: cs.CV
TL;DR: LA-Sign is a looped transformer framework with geometry-aware alignment for skeleton-based isolated sign language recognition, using recurrence instead of deeper layers to refine motion understanding and hyperbolic space projection for multi-scale semantic organization.
Details
Motivation: Existing deep feed-forward architectures for skeleton-based sign language recognition lack mechanisms for recurrent refinement and structured representation, despite the need for fine-grained understanding of articulated motion across multiple spatial scales.Method: Proposes LA-Sign with looped transformer framework using recurrence to repeatedly revisit latent representations for progressive motion refinement. Introduces geometry-aware contrastive objective that projects skeletal and textual features into adaptive hyperbolic space for multi-scale semantic organization. Explores three looping designs and multiple geometric manifolds.
Result: Encoder-decoder looping combined with adaptive Poincare alignment yields strongest performance. Achieves state-of-the-art results on WLASL and MSASL benchmarks while using fewer unique layers than traditional approaches.
Conclusion: Recurrent latent refinement and geometry-aware representation learning are effective for sign language recognition, demonstrating that looped transformers with geometric alignment can outperform deeper feed-forward architectures.
Abstract: Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.
[111] Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss
Main category: cs.CV
TL;DR: The paper analyzes the modality gap in vision-language models, showing it emerges from contrastive loss optimization and that reducing this gap improves model robustness without harming clean accuracy.
Details
Motivation: To understand why a modality gap exists in vision-language models (where image and text embeddings are separated in shared space) and whether closing this gap improves downstream task performance and robustness.Method: Theoretical analysis shows contrastive loss minimization yields separated modalities with orthogonal gap vector. Experimental validation with real-world VLMs using simple post-processing to move one modality toward the other’s mean.
Result: Modality gap is monotonically related to robustness - decreasing gap maintains clean accuracy but improves robustness to perturbations. Simple post-processing significantly increases robustness for many real-world VLMs.
Conclusion: The modality gap in VLMs is not a bug but a feature emerging from optimization, and strategically reducing it can enhance robustness without performance loss, offering practical improvements.
Abstract: Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.
[112] WorldFlow3D: Flowing Through 3D Distributions for Unbounded World Generation
Amogh Joshi, Julian Ost, Felix Heide
Main category: cs.CV
TL;DR: WorldFlow3D: A flow matching approach for generating unbounded 3D worlds using transport between data distributions, enabling controllable scene generation with geometric structure and texture control.
Details
Motivation: Unbounded 3D world generation is foundational for scene modeling in computer vision, graphics, and robotics. Current methods are limited by conditional denoising approaches, and there's a need for more general 3D generation methods that can handle complex structure and high-quality texture generation.Method: Uses flow matching to define transport paths between 3D data distributions, not limited to conditional denoising. Generates causal and accurate 3D structure as intermediate distribution to guide complex structure and texture generation. Enables controllability through vectorized scene layout conditions for geometric structure and scene attributes for visual texture control.
Result: Effective on both real outdoor driving scenes and synthetic indoor scenes, showing cross-domain generalizability and high-quality generation on real data distributions. Achieves favorable scene generation fidelity over existing approaches in all tested settings for unbounded scene generation, with faster convergence than existing methods.
Conclusion: WorldFlow3D presents a novel flow matching approach for unbounded 3D world generation that outperforms existing methods, offers controllability, and demonstrates strong performance across different domains including real-world driving scenes and synthetic indoor environments.
Abstract: Unbounded 3D world generation is emerging as a foundational task for scene modeling in computer vision, graphics, and robotics. In this work, we present WorldFlow3D, a novel method capable of generating unbounded 3D worlds. Building upon a foundational property of flow matching - namely, defining a path of transport between two data distributions - we model 3D generation more generally as a problem of flowing through 3D data distributions, not limited to conditional denoising. We find that our latent-free flow approach generates causal and accurate 3D structure, and can use this as an intermediate distribution to guide the generation of more complex structure and high-quality texture - all while converging more rapidly than existing methods. We enable controllability over generated scenes with vectorized scene layout conditions for geometric structure control and visual texture control through scene attributes. We confirm the effectiveness of WorldFlow3D on both real outdoor driving scenes and synthetic indoor scenes, validating cross-domain generalizability and high-quality generation on real data distributions. We confirm favorable scene generation fidelity over approaches in all tested settings for unbounded scene generation. For more, see https://light.princeton.edu/worldflow3d.
[113] TrajectoryMover: Generative Movement of Object Trajectories in Videos
Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang, Paul Guerrero
Main category: cs.CV
TL;DR: TrajectoryAtlas enables generative movement of object trajectories in videos by creating synthetic paired data and training a video generator called TrajectoryMover.
Details
Motivation: Existing video editing methods focus on prescribing motion trajectories or altering appearance, but lack the ability to move an object's 3D motion trajectory while preserving its relative 3D motion. The main challenge is obtaining paired video data for this specific scenario.Method: Introduces TrajectoryAtlas, a data generation pipeline for large-scale synthetic paired video data, and TrajectoryMover, a video generator fine-tuned with this data to enable generative movement of object trajectories.
Result: Successfully enables generative movement of object trajectories in videos, addressing the previously missing capability in video editing.
Conclusion: The proposed approach overcomes the data scarcity challenge for trajectory movement editing and demonstrates effective generative video editing capabilities for object trajectory manipulation.
Abstract: Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object’s 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video’s plausibility and identity. Yet a method to move an object’s 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover
[114] Enhancing Box and Block Test with Computer Vision for Post-Stroke Upper Extremity Motor Evaluation
David Robinson, Animesh Gupta, Elizabeth Clark, Olga Melnik, Qiushi Fu, Mubarak Shah
Main category: cs.CV
TL;DR: Computer vision framework analyzes upper-extremity movement quality during Box and Block Test using world-aligned joint angles from monocular video, revealing movement patterns beyond standard clinical scores.
Details
Motivation: Standard clinical assessments of upper-extremity motor function after stroke lack sensitivity (ordinal scoring) or don't capture movement quality (time-based metrics). There's a need for objective, quantitative movement analysis that doesn't disrupt clinical workflow.Method: Developed a computer vision framework using monocular video to extract world-aligned joint angles of fingers, arm, and trunk during Box and Block Test. Applied unsupervised dimensionality reduction (likely PCA or t-SNE) to joint-angle features to analyze movement patterns without clinical labels.
Result: Analyzed 136 BBT recordings from 48 healthy individuals and 7 stroke patients. Embeddings showed separation between healthy movement patterns and stroke-related deviations. Some patients with identical BBT scores could be distinguished by different postural patterns.
Conclusion: World-aligned joint angles from monocular video can capture meaningful movement quality information beyond standard BBT scores, enabling calibration-free, camera-based assessment without changing clinical routine.
Abstract: Standard clinical assessments of upper-extremity motor function after stroke either rely on ordinal scoring, which lacks sensitivity, or time-based task metrics, which do not capture movement quality. In this work, we present a computer vision-based framework for analysis of upper-extremity movement during the Box and Block Test (BBT) through world-aligned joint angles of fingers, arm, and trunk without depth sensors or calibration objects. We apply this framework to a dataset of 136 BBT recordings collected from 48 healthy individuals and 7 individuals post stroke. Using unsupervised dimensionality reduction of joint-angle features, we analyze movement patterns without relying on expert clinical labels. The resulting embeddings show separation between healthy movement patterns and stroke-related movement deviations. Importantly, some patients with the same BBT scores can be separated with different postural patterns. These results show that world-aligned joint angles can capture meaningful information of upper-extremity functions beyond standard time-based BBT scores, with no effort from the clinician other than monocular video recordings of the patient using a phone or camera. This work highlights the potential of a camera-based, calibration-free framework to measure movement quality in clinical assessments without changing the widely adopted clinical routine.
[115] Dual-Imbalance Continual Learning for Real-World Food Recognition
Xiaoyan Zhang, Jiangpeng He
Main category: cs.CV
TL;DR: DIME is a dual-imbalance-aware adapter merging framework for continual food recognition that addresses both within-class sample imbalance and step-imbalanced learning of new food categories.
Details
Motivation: Real-world food recognition faces two key challenges: 1) severe data imbalance with long-tailed class distributions, and 2) continual learning where new categories are introduced with highly uneven numbers across steps, creating dual imbalance that existing methods don't address.Method: DIME learns lightweight adapters for each task using parameter-efficient fine-tuning, then integrates them through a class-count guided spectral merging strategy with rank-wise threshold modulation to preserve dominant knowledge while allowing adaptive updates.
Result: Experiments on realistic long-tailed food benchmarks under step-imbalanced setups show DIME consistently improves by more than 3% over the strongest existing continual learning baselines.
Conclusion: DIME effectively addresses dual imbalance in continual food recognition through adapter merging with class-count guidance and threshold modulation, enabling efficient deployment with a single merged adapter.
Abstract: Visual food recognition in real-world dietary logging scenarios naturally exhibits severe data imbalance, where a small number of food categories appear frequently while many others occur rarely, resulting in long-tailed class distributions. In practice, food recognition systems often operate in a continual learning setting, where new categories are introduced sequentially over time. However, existing studies typically assume that each incremental step introduces a similar number of new food classes, which rarely happens in real world where the number of newly observed categories can vary significantly across steps, leading to highly uneven learning dynamics. As a result, continual food recognition exhibits a dual imbalance: imbalanced samples within each food class and imbalanced numbers of new food classes to learn at each incremental learning step. In this work, we introduce DIME, a Dual-Imbalance-aware Adapter Merging framework for continual food recognition. DIME learns lightweight adapters for each task using parameter-efficient fine-tuning and progressively integrates them through a class-count guided spectral merging strategy. A rank-wise threshold modulation mechanism further stabilizes the merging process by preserving dominant knowledge while allowing adaptive updates. The resulting model maintains a single merged adapter for inference, enabling efficient deployment without accumulating task-specific modules. Experiments on realistic long-tailed food benchmarks under our step-imbalanced setup show that the proposed method consistently improves by more than 3% over the strongest existing continual learning baselines. Code is available at https://github.com/xiaoyanzhang1/DIME.
[116] SparseDriveV2: Scoring is All You Need for End-to-End Autonomous Driving
Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, Sifa Zheng
Main category: cs.CV
TL;DR: SparseDriveV2 improves scoring-based autonomous driving planning by using factorized trajectory representation (paths × velocity profiles) and two-stage scoring to achieve state-of-the-art performance on driving benchmarks.
Details
Motivation: Existing scoring-based planning methods either use coarse static trajectory vocabularies or dynamic proposals. The paper investigates whether dense static vocabularies can match dynamic proposal performance, finding that performance improves with vocabulary density without saturation.Method: Two innovations: 1) Scalable factorized vocabulary representation decomposing trajectories into geometric paths and velocity profiles for combinatorial action space coverage; 2) Two-stage scoring with coarse factorized scoring over paths/velocity profiles followed by fine-grained scoring on composed trajectories.
Result: Achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive using lightweight ResNet-34 backbone.
Conclusion: Dense static vocabularies with proper factorization and scoring strategies can achieve state-of-the-art planning performance without dynamic trajectory generation, offering computational efficiency benefits.
Abstract: End-to-end multi-modal planning has been widely adopted to model the uncertainty of driving behavior, typically by scoring candidate trajectories and selecting the optimal one. Existing approaches generally fall into two categories: scoring a large static trajectory vocabulary, or scoring a small set of dynamically generated proposals. While static vocabularies often suffer from coarse discretization of the action space, dynamic proposals provide finer-grained precision and have shown stronger empirical performance on existing benchmarks. However, it remains unclear whether dynamic generation is fundamentally necessary, or whether static vocabularies can already achieve comparable performance when they are sufficiently dense to cover the action space. In this work, we start with a systematic scaling study of Hydra-MDP, a representative scoring-based method, revealing that performance consistently improves as trajectory anchors become denser, without exhibiting saturation before computational constraints are reached. Motivated by this observation, we propose SparseDriveV2 to push the performance boundary of scoring-based planning through two complementary innovations: (1) a scalable vocabulary representation with a factorized structure that decomposes trajectories into geometric paths and velocity profiles, enabling combinatorial coverage of the action space, and (2) a scalable scoring strategy with coarse factorized scoring over paths and velocity profiles followed by fine-grained scoring on a small set of composed trajectories. By combining these two techniques, SparseDriveV2 achieves 92.0 PDMS and 90.1 EPDMS on NAVSIM, with 89.15 Driving Score and 70.00 Success Rate on Bench2Drive with a lightweight ResNet-34 as backbone. Code and model are released at https://github.com/swc-17/SparseDriveV2.
[117] LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
Haihong Hao, Lei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang
Main category: cs.CV
TL;DR: LatentPilot introduces a vision-and-language navigation model that learns action-conditioned visual dynamics by exploiting future observations during training, enabling agents to “dream ahead” about how actions affect subsequent visual observations without needing future frames at inference.
Details
Motivation: Current VLN models focus on past/current observations but ignore future visual dynamics from actions, lacking causal understanding of how actions change the visual world. Humans can imagine near-future outcomes using action-dynamics causality, which improves navigation decisions.Method: Uses flywheel-style training with iterative on-policy trajectory collection and retraining, expert takeover for deviation correction. Learns visual latent tokens without supervision that attend globally in continuous latent space, carried across steps to enable “dreaming ahead” about action effects.
Result: Achieves new SOTA results on R2R-CE, RxR-CE, and R2R-PE benchmarks. Real-robot tests across diverse environments demonstrate superior understanding of environment-action dynamics.
Conclusion: LatentPilot successfully enables VLN agents to learn action-conditioned visual dynamics by exploiting future observations during training, improving navigation performance through better causal understanding of how actions affect visual changes.
Abstract: Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent’s behavior distribution, with an expert takeover triggered when the agent deviates excessively. LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to dream ahead and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot’s superior understanding of environment-action dynamics in scene. Project page:https://abdd.top/latentpilot/
[118] CT-to-X-ray Distillation Under Tiny Paired Cohorts: An Evidence-Bounded Reproducible Pilot Study
Bo Ma, Jinsong Wu, Weiqi Yan, Hongjiang Wei
Main category: cs.CV
TL;DR: Cross-modality distillation from CT to X-ray for thoracic disease classification shows limited effectiveness, with simpler methods often outperforming complex architectures in deployment-oriented scenarios.
Details
Motivation: To investigate whether CT scans can serve as training-only supervision for X-ray disease classifiers without requiring CT at inference time, addressing the complementary nature of thoracic imaging modalities in clinical practice.Method: Uses cross-modality teacher-student distillation with JDCNet as a pilot scaffold, comparing plain logit-KD against attention transfer and feature hints. Conducts patient-level Monte Carlo resampling with same-case comparisons and imbalance-sensitive analyses.
Result: Simple methods outperformed complex architectures: late fusion achieved highest accuracy (0.885), same-modality distillation highest macro-F1 (0.554), while cross-modal distillation dropped to 0.500 balanced accuracy. No robust cross-modality advantage was found.
Conclusion: The study provides a reproducible pilot protocol rather than a validated architecture, highlighting task definition challenges, failure modes, ranking instability, and minimum requirements for credible CT-to-X-ray transfer claims.
Abstract: Chest X-ray and computed tomography (CT) provide complementary views of thoracic disease, yet most computer-aided diagnosis models are trained and deployed within a single imaging modality. The concrete question studied here is narrower and deployment-oriented: on a patient-level paired chest cohort, can CT act as training-only supervision for a binary disease versus non-disease X-ray classifier without requiring CT at inference time? We study this setting as a cross-modality teacher–student distillation problem and use JDCNet as an executable pilot scaffold rather than as a validated superior architecture. On the original patient-level paired split from a public paired chest imaging cohort, a stripped-down plain cross-modal logit-KD control attains the highest mean result on the four-image validation subset (0.875 accuracy and 0.714 macro-F1), whereas the full module-augmented JDCNet variant remains at 0.750 accuracy and 0.429 macro-F1. To test whether that ranking is a split artifact, we additionally run eight patient-level Monte Carlo resamples with same-case comparisons, stronger mechanism controls based on attention transfer and feature hints, and imbalance-sensitive analyses. Under this resampled protocol, late fusion attains the highest mean accuracy (0.885), same-modality distillation attains the highest mean macro-F1 (0.554) and balanced accuracy (0.660), the plain cross-modal control drops to 0.500 mean balanced accuracy, and neither attention transfer nor feature hints recover a robust cross-modality advantage. The contribution of this study is therefore not a validated CT-to-X-ray architecture, but a reproducible and evidence-bounded pilot protocol that makes the exact task definition, failure modes, ranking instability, and the minimum requirements for future credible CT-to-X-ray transfer claims explicit.
[119] Segmentation of Gray Matters and White Matters from Brain MRI data
Chang Sun, Rui Shi, Tsukasa Koike, Tetsuro Sekine, Akio Morita, Tetsuya Sakai
Main category: cs.CV
TL;DR: Modified MedSAM foundation model adapted for multi-class brain tissue segmentation (gray matter, white matter) using FSL preprocessing and minimal architectural changes.
Details
Motivation: Accurate brain tissue segmentation from MRI is essential for neurological studies and diagnosis, but traditional methods like FSL FAST require task-specific adjustments and struggle with diverse imaging conditions. Foundation models like MedSAM offer promising prompt-based approaches that could be adapted for medical imaging tasks.Method: Proposes a modified MedSAM model for multi-class brain tissue segmentation. Uses preprocessing pipeline with FSL BET for skull stripping and FSL FAST for tissue probability maps, converted to 2D axial/sagittal/coronal slices with multi-class labels. Extends MedSAM’s mask decoder to three classes (background, gray matter, white matter), freezes pre-trained image encoder, and fine-tunes prompt encoder and decoder.
Result: Experiments on IXI dataset achieve Dice scores up to 0.8751, demonstrating that foundation models like MedSAM can be effectively adapted for multi-class medical image segmentation with minimal architectural modifications.
Conclusion: Foundation models like MedSAM can be successfully adapted for multi-class medical image segmentation tasks with minimal changes, suggesting potential for extension to more diverse medical imaging scenarios in future work.
Abstract: Accurate segmentation of brain tissues such as gray matter and white matter from magnetic resonance imaging is essential for studying brain anatomy, diagnosing neurological disorders, and monitoring disease progression. Traditional methods, such as FSL FAST, produce tissue probability maps but often require task-specific adjustments and face challenges with diverse imaging conditions. Recent foundation models, such as MedSAM, offer a prompt-based approach that leverages large-scale pretraining. In this paper, we propose a modified MedSAM model designed for multi-class brain tissue segmentation. Our preprocessing pipeline includes skull stripping with FSL BET, tissue probability mapping with FSL FAST, and converting these into 2D axial, sagittal, coronal slices with multi-class labels (background, gray matter, and white matter). We extend MedSAM’s mask decoder to three classes, freezing the pre-trained image encoder and fine-tuning the prompt encoder and decoder. Experiments on the IXI dataset achieve Dice scores up to 0.8751. This work demonstrates that foundation models like MedSAM can be adapted for multi-class medical image segmentation with minimal architectural modifications. Our findings suggest that such models can be extended to more diverse medical imaging scenarios in future work.
[120] Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
Huaqi Tao, Bingxi Liu, Guangcheng Chen, Fulin Tang, Li He, Hong Zhang
Main category: cs.CV
TL;DR: SplatHLoc is a hierarchical visual relocalization framework using Feature Gaussian Splatting as scene representation, with adaptive viewpoint retrieval and hybrid feature matching for improved pose estimation.
Details
Motivation: Point-based hierarchical relocalization methods suffer from sparse image observations and weak feature matching, limiting their effectiveness in camera pose estimation for revisited scenes.Method: Uses Feature Gaussian Splatting as scene representation, adaptive viewpoint retrieval to synthesize virtual candidates aligned with query viewpoints, and hybrid feature matching combining Gaussian-rendered features (coarse stage) and image-extracted features (fine stage).
Result: Extensive experiments on indoor and outdoor datasets show enhanced robustness and state-of-the-art performance in visual relocalization.
Conclusion: SplatHLoc improves visual relocalization through novel scene representation and hybrid matching strategy, advancing hierarchical relocalization methods.
Abstract: Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical relocalization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. To address the sparsity of database images, we propose an adaptive viewpoint retrieval method that synthesizes virtual candidates with viewpoints more closely aligned with the query, thereby improving the accuracy of initial pose estimation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc enhances the robustness of visual relocalization, setting a new state-of-the-art.
[121] SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki
Main category: cs.CV
TL;DR: SLVMEval is a benchmark for meta-evaluating text-to-video evaluation systems using synthetic long videos up to 3 hours, testing their ability to accurately assess video quality in human-perceptible degradation scenarios.
Details
Motivation: Existing text-to-video evaluation systems need proper meta-evaluation, especially for long videos. The paper addresses the lack of benchmarks for assessing whether these systems can accurately evaluate video quality in scenarios that are easy for humans to judge.Method: Creates SLVMEval benchmark using synthetic degradation of source videos from dense video-captioning datasets to produce controlled “high-quality vs low-quality” pairs across 10 aspects. Uses crowdsourcing to filter pairs where degradation is clearly perceptible, then employs pairwise comparison-based meta-evaluation framework to assess existing evaluation systems.
Result: Human evaluators achieve 84.7%-96.8% accuracy in identifying better long videos. In 9 out of 10 aspects, existing evaluation systems’ accuracy falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.
Conclusion: Current text-to-video evaluation systems have significant limitations in assessing long videos, with most aspects showing performance gaps compared to human judgment. The SLVMEval benchmark effectively exposes these weaknesses and provides a foundation for improving evaluation methodologies.
Abstract: This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled “high-quality versus low-quality” pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.
[122] 3D Architect: An Automated Approach to Three-Dimensional Modeling
Sunil Tiwari, Payal Fofadiya, Vicky Vishwakarma
Main category: cs.CV
TL;DR: 3D object reconstruction from orthographic views using Harris corner detection, envelope intersection, and computational geometry
Details
Motivation: To automatically reconstruct 3D objects from 2D orthographic views (like engineering drawings) without manual interventionMethod: 1. Apply Harris corner detector on input orthographic views to get control points. 2. Project control points perpendicular to respective views to construct 3D envelopes. 3. Obtain 3D points from intersection of perpendicular envelopes. 4. Use computational geometry to regenerate surfaces from point set. 5. Render final 3D object using OpenGL.
Result: Successfully reconstructs 3D objects from orthographic views through automated corner detection and geometric reconstruction
Conclusion: The method provides an automated pipeline for 3D reconstruction from engineering drawings using computer vision and computational geometry techniques
Abstract: The aim of our paper is to render an object in 3-dimension using a set of its orthographic views. Corner detector (Harris Detector) is applied on the input views to obtain control points. These control points are projected perpendicular to respective views, in order to construct an envelope. A set of points describing the object in 3-dimension, are obtained from the intersection of these mutually perpendicular envelopes. These set of points are used to regenerate the surfaces of the object using computational geometry. At the end, the object in 3-dimension is rendered using OpenGL
[123] Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions
Payal Fofadiya, Sunil Tiwari
Main category: cs.CV
TL;DR: Adaptive context compression framework for LLMs that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to maintain conversational quality while controlling context growth and reducing computational overhead.
Details
Motivation: LLMs suffer performance degradation in long-running interactions due to increasing context length, memory saturation, and computational overhead. There's a need to balance long-term memory preservation with computational efficiency.Method: Proposes an adaptive context compression framework with three components: 1) importance-aware memory selection to identify essential information, 2) coherence-sensitive filtering to maintain conversational flow, and 3) dynamic budget allocation to control context growth.
Result: Evaluated on LOCOMO, LOCCO, and LongBench benchmarks, showing consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared to existing memory and compression-based approaches.
Conclusion: Adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions, addressing key challenges in long-running conversations.
Abstract: Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression framework that integrates importance-aware memory selection, coherence-sensitive filtering, and dynamic budget allocation to retain essential conversational information while controlling context growth. The approach is evaluated on LOCOMO, LOCCO, and LongBench benchmarks to assess answer quality, retrieval accuracy, coherence preservation, and efficiency. Experimental results demonstrate that the proposed method achieves consistent improvements in conversational stability and retrieval performance while reducing token usage and inference latency compared with existing memory and compression-based approaches. These findings indicate that adaptive context compression provides an effective balance between long-term memory preservation and computational efficiency in persistent LLM interactions
[124] Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention
Sunil Tiwari, Payal Fofadiya
Main category: cs.CV
TL;DR: Multi-Layer Memory Framework for long-horizon dialogue systems addresses semantic drift and memory retention issues through layered memory decomposition with adaptive retrieval gating and retention regularization.
Details
Motivation: Long-horizon dialogue systems suffer from semantic drift and unstable memory retention across extended sessions, leading to degraded performance and inefficient context usage.Method: Proposes a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization to control cross-session drift while maintaining bounded context growth.
Result: Achieved 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40% on LOCOMO, LOCCO, and LoCoMo datasets.
Conclusion: The framework enhances long-term retention and reasoning stability under constrained context budgets, effectively addressing semantic drift and memory management challenges in extended dialogue sessions.
Abstract: Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and computational efficiency. Experiments on LOCOMO, LOCCO, and LoCoMo show improved performance, achieving 46.85 Success Rate, 0.618 overall F1 with 0.594 multi-hop F1, and 56.90% six-period retention while reducing false memory rate to 5.1% and context usage to 58.40%. Results confirm enhanced long-term retention and reasoning stability under constrained context budgets.
[125] LightHarmony3D: Harmonizing Illumination and Shadows for Object Insertion in 3D Gaussian Splatting
Tianyu Huang, Zhenyang Ren, Zhenchen Wan, Jiyang Zheng, Wenjie Wang, Runnan Chen, Mingming Gong, Tongliang Liu
Main category: cs.CV
TL;DR: LightHarmony3D enables physically consistent lighting and shadows for mesh insertion in 3D Gaussian Splatting scenes using a generative module that predicts 360° HDR environment maps.
Details
Motivation: Inserting external mesh objects into 3DGS scenes enables interactive editing for AR/VR and digital content creation, but achieving physically consistent lighting and shadows remains challenging due to illumination estimation and multi-view consistency requirements.Method: Proposes a generative module that predicts full 360° HDR environment maps at insertion locations via single forward pass, leveraging generative priors instead of iterative optimization to capture dominant scene illumination for physically grounded shading and shadows.
Result: Extensive experiments across multiple real-world reconstruction datasets demonstrate state-of-the-art realism and multi-view consistency. Also introduces first dedicated benchmark for mesh insertion in 3DGS for standardized evaluation.
Conclusion: LightHarmony3D effectively addresses illumination consistency challenges for mesh insertion in 3DGS scenes, enabling physically grounded rendering while maintaining multi-view coherence through efficient generative environment map prediction.
Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity reconstruction of scene geometry and appearance. Building on this capability, inserting external mesh objects into reconstructed 3DGS scenes enables interactive editing and content augmentation for immersive applications such as AR/VR, virtual staging, and digital content creation. However, achieving physically consistent lighting and shadows for mesh insertion remains challenging, as it requires accurate scene illumination estimation and multi-view consistent rendering. To address this challenge, we present LightHarmony3D, a novel framework for illumination-consistent mesh insertion in 3DGS scenes. Central to our approach is our proposed generative module that predicts a full 360° HDR environment map at the insertion location via a single forward pass. By leveraging generative priors instead of iterative optimization, our method efficiently captures dominant scene illumination and enables physically grounded shading and shadows for inserted meshes while maintaining multi-view coherence. Furthermore, we introduce the first dedicated benchmark for mesh insertion in 3DGS, providing a standardized evaluation framework for assessing lighting consistency and photorealism. Extensive experiments across multiple real-world reconstruction datasets demonstrate that LightHarmony3D achieves state-of-the-art realism and multi-view consistency.
[126] CCDNet: Learning to Detect Camouflage against Distractors in Infrared Small Target Detection
Zikai Liao, Zhaozheng Yin
Main category: cs.CV
TL;DR: CCDNet is a novel infrared target detection network that addresses camouflage and distractor challenges through weighted multi-branch perceptrons, aggregation-refinement fusion, and contrastive distractor discrimination.
Details
Motivation: Infrared target detection is challenging due to low contrast, camouflage effects where targets blend into complex backgrounds, and distractors causing false alarms. Current methods struggle with these issues in critical applications like wilderness rescue and maritime search.Method: Proposes CCDNet with three key components: 1) Backbone with Weighted Multi-branch Perceptrons (WMPs) for multi-level feature aggregation, 2) Aggregation-and-Refinement Fusion Neck (ARFN) to refine structure/semantics and reconstruct target-background relations, 3) Contrastive-aided Distractor Discriminator (CaDD) for adaptive similarity computation to distinguish real targets from distractors.
Result: Extensive experiments on infrared image datasets show CCDNet outperforms state-of-the-art methods, achieving improved detection accuracy and reduced false alarm rates.
Conclusion: CCDNet effectively addresses infrared target detection challenges by handling camouflage through feature refinement and reducing false alarms via distractor discrimination, demonstrating superior performance over existing approaches.
Abstract: Infrared target detection (IRSTD) tasks have critical applications in areas like wilderness rescue and maritime search. However, detecting infrared targets is challenging due to their low contrast and tendency to blend into complex backgrounds, effectively camouflaging themselves. Additionally, other objects with similar features (distractors) can cause false alarms, further degrading detection performance. To address these issues, we propose a novel \textbf{C}amouflage-aware \textbf{C}ounter-\textbf{D}istraction \textbf{Net}work (CCDNet) in this paper. We design a backbone with Weighted Multi-branch Perceptrons (WMPs), which aggregates self-conditioned multi-level features to accurately represent the target and background. Based on these rich features, we then propose a novel Aggregation-and-Refinement Fusion Neck (ARFN) to refine structures/semantics from shallow/deep features maps, and bidirectionally reconstruct the relations between the targets and the backgrounds, highlighting the targets while suppressing the complex backgrounds to improve detection accuracy. Furthermore, we present a new Contrastive-aided Distractor Discriminator (CaDD), enforcing adaptive similarity computation both locally and globally between the real targets and the backgrounds to more precisely discriminate distractors, so as to reduce the false alarm rate. Extensive experiments on infrared image datasets confirm that CCDNet outperforms other state-of-the-art methods.
[127] M2H-MX: Multi-Task Dense Visual Perception for Real-Time Monocular Spatial Understanding
U. V. B. L. Udugama, George Vosselman, Francesco Nex
Main category: cs.CV
TL;DR: M2H-MX is a real-time multi-task perception model for monocular spatial understanding that combines depth and semantic prediction in a lightweight decoder for robotic mapping applications.
Details
Motivation: Monocular cameras are attractive for robotics due to low cost and ease of deployment, but achieving reliable real-time spatial understanding from single images remains challenging. While multi-task dense prediction models have improved, integrating them into stable monocular mapping systems is non-trivial.Method: M2H-MX preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder. This enables depth and semantic predictions to reinforce each other under strict latency constraints. The outputs integrate directly into monocular SLAM via a compact perception-to-mapping interface.
Result: On NYUDv2, M2H-MX-L achieves state-of-the-art results: semantic mIoU improves by 6.6% and depth RMSE reduces by 9.4% over multi-task baselines. In real-time monocular mapping on ScanNet, it reduces average trajectory error by 60.7% compared to strong monocular SLAM baselines while producing cleaner metric-semantic maps.
Conclusion: Modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems, demonstrating that accurate depth and semantic estimation can be effectively combined for practical robotic applications.
Abstract: Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.
[128] Diffusion Mental Averages
Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn
Main category: cs.CV
TL;DR: Diffusion Mental Averages (DMA) is a model-centric method that produces sharp, realistic concept averages by aligning denoising trajectories in diffusion model semantic space, outperforming blurry data-centric averaging approaches.
Details
Motivation: Existing methods for averaging concepts in diffusion models produce blurry results because they average image collections externally (data-centric) rather than working within the model's generative process. The authors aim to create sharp, realistic concept averages that serve as visual summaries and reveal model biases.Method: DMA averages within the diffusion model’s semantic space by optimizing multiple noise latents so their denoising trajectories converge toward shared semantics across timesteps. For multimodal concepts, samples are clustered in CLIP space, then bridged into diffusion space using Textual Inversion or LoRA.
Result: DMA produces consistent, realistic averages even for abstract concepts, serving as effective visual summaries and revealing model biases in concept representation. The method outperforms prior data-centric averaging approaches that yield blurry results.
Conclusion: DMA provides a model-centric solution for generating sharp concept averages by aligning denoising trajectories within diffusion semantic space, offering insights into model biases and concept representation while producing realistic prototypes.
Abstract: Can a diffusion model produce its own “mental average” of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model’s semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.
[129] Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method
Yanjiao Song, Bowen Cai, Timo Balz, Zhenfeng Shao, Neema Simon Sumari, James Magidi, Walter Musakwa
Main category: cs.CV
TL;DR: A novel two-stream neural network (TSONet) for monocular building height estimation from PhiSat-2 satellite imagery, with a new dataset (PHDataset) showing improved performance through joint footprint segmentation and ordinal height regression.
Details
Motivation: Monocular building height estimation from optical imagery is challenging due to ambiguous height cues, inter-city variations, and long-tailed height distributions. PhiSat-2 satellite data offers global coverage and multispectral observations but hasn't been systematically evaluated for this task.Method: Proposes TSONet (Two-Stream Ordinal Network) that jointly models footprint segmentation and height estimation. Includes Cross-Stream Exchange Module (CSEM) for footprint-aware feature interaction and Feature-Enhanced Bin Refinement (FEBR) for ordinal height refinement. Uses a new PhiSat-2-Height dataset (PHDataset) with 9,475 image-label pairs from 26 cities.
Result: TSONet achieves best overall performance on PHDataset, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over strongest competing methods. Ablation studies confirm effectiveness of CSEM, FEBR, and joint ordinal regression with footprint assistance.
Conclusion: The study confirms PhiSat-2’s potential for monocular building height estimation and provides a dedicated dataset and effective method for future research, showing benefits from balanced spatial detail and multispectral observations.
Abstract: Monocular building height estimation from optical imagery is important for urban morphology characterization but remains challenging due to ambiguous height cues, large inter-city variations in building morphology, and the long-tailed distribution of building heights. PhiSat-2 is a promising open-access data source for this task because of its global coverage, 4.75 m spatial resolution, and seven-band spectral observations, yet its potential has not been systematically evaluated. To address this gap, we construct a PhiSat-2-Height dataset (PHDataset) and propose a Two-Stream Ordinal Network (TSONet). PHDataset contains 9,475 co-registered image-label patch pairs from 26 cities worldwide. TSONet jointly models footprint segmentation and height estimation, and introduces a Cross-Stream Exchange Module (CSEM) and a Feature-Enhanced Bin Refinement (FEBR) module for footprint-aware feature interaction and ordinal height refinement. Experiments on PHDataset show that TSONet achieves the best overall performance, reducing MAE and RMSE by 13.2% and 9.7%, and improving IoU and F1-score by 14.0% and 10.1% over the strongest competing results. Ablation studies further verify the effectiveness of CSEM, FEBR, and the joint use of ordinal regression and footprint assistance. Additional analyses indicate that PhiSat-2 benefits monocular building height estimation through its balanced combination of building-relevant spatial detail and multispectral observations. Overall, this study confirms the potential of PhiSat-2 for monocular building height estimation and provides a dedicated dataset and an effective method for future research.
[130] Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji
Main category: cs.CV
TL;DR: FlexMem is a training-free approach for long video understanding in MLLMs that mimics human video watching behavior using visual memory mechanisms to handle infinite-length videos.
Details
Motivation: Long video understanding is a major challenge for Multimodal Large Language Models (MLLMs) due to input length limitations. Current methods process all video information at once and have upper limits, preventing effective understanding of long videos.Method: FlexMem uses a visual memory mechanism with dual-pathway compression for memory transfer/writing, and explores different memory reading strategies for diverse video tasks including streaming. It treats visual KV caches as memory sources and mimics human video watching behavior.
Result: On a single 3090 GPU, FlexMem achieves improvements over existing efficient video methods, processes over 1k frames, and helps base MLLMs achieve comparable or better performance than SOTA models like GPT-4o and Gemini-1.5 Pro on some benchmarks.
Conclusion: FlexMem provides an effective training-free solution for long video understanding in MLLMs, enabling infinite-length video processing through visual memory mechanisms and achieving strong performance with efficient resource usage.
Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.
[131] Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Jingqi Xu
Main category: cs.CV
TL;DR: Omni-NegCLIP improves CLIP’s understanding of negation expressions through specialized contrastive objectives for presence-based and absence-based negation, fine-tuning only the front transformer layers of the text encoder.
Details
Motivation: Vision-Language Models like CLIP perform poorly in understanding negation expressions, which are common in natural language but challenging for current models to comprehend accurately.Method: Proposes two contrastive objectives: presence-based (pulls image embeddings closer to original captions, pushes away from negated captions) and absence-based (aligns image embeddings with both original and negated captions while maintaining semantic distinction). Fine-tunes only the front transformer layers of CLIP text encoder based on observation that these layers have stronger learning ability for negated text.
Result: Improves performance on presence-based negation tasks by up to 52.65% and absence-based negation by up to 12.50% compared to pretrained CLIP, without sacrificing general image-text retrieval capability (even improving it by up to 19.62%).
Conclusion: Omni-NegCLIP demonstrates comprehensive ability to understand multiple types of negation tasks while maintaining or improving general multimodal capabilities, addressing a significant weakness in current VLMs.
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP’s understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP’s original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.
[132] Unbiased Model Prediction Without Using Protected Attribute Information
Puspita Majumdar, Surbhi Mittal, Mayank Vatsa, Richa Singh
Main category: cs.CV
TL;DR: NPAD algorithm mitigates bias in deep learning models without requiring protected attribute information by using non-protected attributes and novel loss functions (DACL and FRL).
Details
Motivation: Existing fairness algorithms require protected attribute information (like race, gender) which limits real-world deployment due to privacy/legal concerns. Need for bias mitigation methods that work without protected attributes.Method: Proposes NPAD algorithm that uses non-protected attributes (like facial features) as auxiliary information. Introduces two loss functions: DACL (Debiasing via Attribute Cluster Loss) and FRL (Filter Redundancy Loss) to optimize models for fairness without protected attributes.
Result: Experiments on LFWA and CelebA datasets for facial attribute prediction show significant bias reduction across gender and age subgroups without using protected attributes.
Conclusion: NPAD provides effective bias mitigation without requiring protected attribute information, making it more practical for real-world applications while maintaining fairness across demographic subgroups.
Abstract: The problem of bias persists in the deep learning community as models continue to provide disparate performance across different demographic subgroups. Therefore, several algorithms have been proposed to improve the fairness of deep models. However, a majority of these algorithms utilize the protected attribute information for bias mitigation, which severely limits their application in real-world scenarios. To address this concern, we have proposed a novel algorithm, termed as \textbf{Non-Protected Attribute-based Debiasing (NPAD)} algorithm for bias mitigation, that does not require the protected attribute information. The proposed NPAD algorithm utilizes the auxiliary information provided by the non-protected attributes to optimize the model for bias mitigation. Further, two different loss functions, \textbf{Debiasing via Attribute Cluster Loss (DACL)} and \textbf{Filter Redundancy Loss (FRL)} have been proposed to optimize the model for fairness goals. Multiple experiments are performed on the LFWA and CelebA datasets for facial attribute prediction, and a significant reduction in bias across different gender and age subgroups is observed.
[133] ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation
Wenyang Chen, Zhanxuan Hu, Yaping Zhang, Hailong Ning, Yonghang Tai
Main category: cs.CV
TL;DR: ConInfer is a context-aware inference framework for open-vocabulary remote sensing segmentation that performs joint prediction across spatial units with explicit semantic dependency modeling, improving segmentation consistency and accuracy over patch-wise methods.
Details
Motivation: Existing training-free open-vocabulary remote sensing segmentation methods use patch-wise predictions that ignore the strong spatial and semantic correlations in large-scale remote sensing scenes, leading to insufficient segmentation accuracy.Method: Proposes ConInfer, a context-aware inference framework that performs joint prediction across multiple spatial units while explicitly modeling inter-unit semantic dependencies, incorporating global contextual cues for better segmentation.
Result: Extensive experiments show ConInfer consistently surpasses state-of-the-art per-pixel VLM-based baselines like SegEarth-OV, achieving average improvements of 2.80% on open-vocabulary semantic segmentation and 6.13% on object extraction tasks.
Conclusion: The context-aware joint prediction framework significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments by addressing the limitations of isolated patch-wise predictions.
Abstract: Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer
[134] MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Soomin Park, Eunseong Lee, Kwang Bin Lee, Sung-Hee Lee
Main category: cs.CV
TL;DR: MaskAdapt is a two-stage framework for flexible motion adaptation in physics-based humanoid control using mask-invariant base policies and residual learning for targeted body-part modifications.
Details
Motivation: The paper addresses the need for flexible motion adaptation in physics-based humanoid control, where existing methods often lack the ability to modify only specific body parts while preserving original behaviors elsewhere, limiting their versatility for applications like motion composition and text-driven partial goal tracking.Method: Two-stage residual learning: 1) Train a mask-invariant base policy using stochastic body-part masking and regularization for consistent action distributions, creating a robust motion prior. 2) Train a residual policy on the frozen base controller to modify only targeted body parts while preserving original behaviors elsewhere.
Result: MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work, particularly in motion composition and text-driven partial goal tracking applications.
Conclusion: The MaskAdapt framework provides an effective approach for flexible motion adaptation in physics-based humanoid control, enabling targeted modifications of specific body parts while maintaining overall stability and original behaviors in other regions.
Abstract: We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
[135] PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models
Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi
Main category: cs.CV
TL;DR: PRISM is a 270K-sample multi-view video dataset for fine-tuning embodied vision-language models in retail environments, featuring spatial, temporal/physical, and embodied action knowledge dimensions.
Details
Motivation: Physical AI systems fail not due to poor visual recognition, but because they lack understanding of space, physical dynamics, and embodied actions needed for reliable real-world operation.Method: Created a 270K-sample multi-view video SFT corpus with egocentric, exocentric, and 360° viewpoints from five supermarket locations, featuring open-ended, chain-of-thought, and multiple-choice supervision across 20+ capability probes in four dimensions.
Result: Fine-tuning on PRISM reduces error rate by 66.6% over pre-trained baseline, with 36.4% accuracy improvement in embodied action understanding, demonstrating significant gains across all 20+ probes.
Conclusion: Ontology-structured, domain-specific SFT can meaningfully strengthen embodied VLMs for real-world settings, with PRISM being the first dataset to instantiate all three knowledge dimensions within a single deployment domain.
Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism
[136] MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network
Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, Yupeng Hu
Main category: cs.CV
TL;DR: MELT is a novel Composed Image Retrieval method that addresses frequency bias and hard negative interference through rare semantic localization and diffusion-based denoising.
Details
Motivation: Existing CIR methods suffer from frequency bias leading to "Rare Sample Neglect" and are susceptible to interference from hard negative samples and noise, which limits their retrieval performance.Method: Proposes MELT network with two key components: (1) asymmetric rare semantic localization to focus on rare modification semantics in multimodal contexts, and (2) diffusion-based denoising applied to hard negative samples with high similarity scores to enhance multimodal fusion and matching.
Result: Extensive experiments on two CIR benchmarks validate the superior performance of MELT compared to existing methods.
Conclusion: MELT effectively addresses the limitations of frequency bias and hard negative interference in CIR, achieving state-of-the-art performance through rare semantic attention and robust similarity estimation.
Abstract: Composed Image Retrieval (CIR) uses a reference image and a modification text as a query to retrieve a target image satisfying the requirement of modifying the reference image according to the text instructions''. However, existing CIR methods face two limitations: (1) frequency bias leading to Rare Sample Neglect’’, and (2) susceptibility of similarity scores to interference from hard negative samples and noise. To address these limitations, we confront two key challenges: asymmetric rare semantic localization and robust similarity estimation under hard negative samples. To solve these challenges, we propose the Modification frEquentation-rarity baLance neTwork MELT. MELT assigns increased attention to rare modification semantics in multimodal contexts while applying diffusion-based denoising to hard negative samples with high similarity scores, enhancing multimodal fusion and matching. Extensive experiments on two CIR benchmarks validate the superior performance of MELT. Codes are available at https://github.com/luckylittlezhi/MELT.
[137] GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection
Yaning Zhang, Linlin Shen, Zitong Yu, Chunjie Ma, Zan Gao
Main category: cs.CV
TL;DR: A gaze-guided CLIP model with adaptive language prompts for fine-grained deepfake attribution and detection, leveraging gaze vector differences between real and fake faces to improve generalization to novel generative methods.
Details
Motivation: Current deepfake attribution/detection methods have poor generalization to novel generative techniques due to limited visual modality exploration and lack of synergy between attribution and detection tasks. There's a need for better evaluation on advanced generators like diffusion models.Method: Proposes gaze-guided CLIP with adaptive-enhanced fine-grained language prompts. Uses visual perception encoder to mine global forgery embeddings from gaze vector differences. Includes gaze-aware image encoder (GIE) that fuses forgery gaze prompts with image embeddings, and language refinement encoder (LRE) with adaptive word selector for precise vision-language matching.
Result: Outperforms state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under attribution and detection settings on their novel benchmark with diffusion and flow models.
Conclusion: The gaze-aware approach effectively enhances generalization to unseen face forgery attacks by leveraging gaze differences between real and fake faces, with adaptive language prompts improving vision-language matching for deepfake attribution and detection.
Abstract: Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.
[138] MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
Haoran Zhou, Gim Hee Lee
Main category: cs.CV
TL;DR: MotionScale: A 4D Gaussian Splatting framework for reconstructing dynamic scenes from monocular videos with scalable motion fields and progressive optimization for temporal consistency.
Details
Motivation: Existing neural rendering methods struggle with accurate 3D geometry and temporally consistent motion reconstruction in complex dynamic scenes from monocular videos.Method: Proposes MotionScale with scalable motion fields using cluster-centric basis transformations and progressive optimization with two decoupled stages: background extension (handles new regions, refines poses, models shadows) and foreground propagation (enforces motion consistency via three-stage refinement).
Result: Significantly outperforms state-of-the-art methods in reconstruction quality and temporal stability on challenging real-world benchmarks.
Conclusion: MotionScale enables realistic 4D scene reconstruction from monocular videos with improved geometry accuracy and motion consistency through scalable motion representation and progressive optimization.
Abstract: Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.
[139] Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
Jiaju Ma, R. Kenny Jones, Jiajun Wu, Maneesh Agrawala
Main category: cs.CV
TL;DR: Adapting self-consistency techniques from text to visual domains for LLM-generated motion graphics trajectories, using geometric transformation groups to identify consistent trajectories.
Details
Motivation: Self-consistency has been effective for improving LLM performance on text reasoning tasks, but adapting it to visual domains remains unexplored. The paper aims to extend self-consistency to visual reasoning, specifically for motion graphics trajectory generation.Method: Proposes modeling shape families as prototype trajectories with geometric transformation groups (rigid, similarity, affine). Samples diverse motion trajectories from LLMs, then identifies consistent trajectories via clustering based on allowable transformations. Uses hierarchical relationships between transformation groups to automatically recover shape families.
Result: Improves accuracy of LLM-based trajectory generation by 4-6%. Extends method to verification tasks, achieving 11% precision gains over VLM baselines.
Conclusion: Successfully adapts self-consistency to visual domains, demonstrating effectiveness for motion graphics trajectory generation and verification through geometric transformation modeling.
Abstract: Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains. Specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., “Move the circle in a spiral path”), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering. Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, and affine). Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using hierarchical relationships between a set of candidate transformation groups. Our approach improves the accuracy of LLM-based trajectory generation by 4-6%. We further extend our method to support verification, observing 11% precision gains over VLM baselines. Our code and dataset are available at https://majiaju.io/trajectory-self-consistency .
[140] HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations
Aryan Yazdan Parast, Khawar Islam, Soyoun Won, Basim Azam, Naveed Akhtar
Main category: cs.CV
TL;DR: A bilevel meta-learning method that performs feature-space augmentation to improve classifier robustness against spurious correlations, particularly effective for minority groups and distribution shifts.
Details
Motivation: Deep neural networks often rely on spurious features that make them brittle under distribution shifts and on minority-group examples. While ERM-trained feature extractors learn informative representations, the classifier head often fails, and retraining just the head can substantially improve performance on shifted distributions and minority groups.Method: Proposes a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. Operates at the backbone output rather than in pixel space or through end-to-end optimization.
Result: The method is highly efficient and stable, requiring only a few minutes of training on a single GPU. CLIP-based visualizations show that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.
Conclusion: Feature-space augmentation through bilevel meta-learning effectively addresses spurious correlation issues in classifier heads, improving robustness on minority groups and distribution shifts while maintaining efficiency.
Abstract: Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.
[141] FOSCU: Feasibility of Synthetic MRI Generation via Duo-Diffusion Models for Enhancement of 3D U-Nets in Hepatic Segmentation
Youngung Han, Kyeonghun Kim, Seoyoung Ju, Yeonju Jean, Minkyung Cha, Seohyoung Park, Hyeonseok Jung, Nam-Joon Kim, Woo Kyoung Jeong, Ken Ying-Kai Liao, Hyuk-Jae Lee
Main category: cs.CV
TL;DR: FOSCU: A framework using Duo-Diffusion (3D latent diffusion with ControlNet) to generate synthetic MRI volumes and segmentation labels to address medical image data scarcity, improving segmentation performance when combined with real data.
Details
Motivation: Address fundamental challenges in medical image segmentation: restricted access to clinical datasets, costly annotation, and data shortage in PACS systems, which impede development of robust segmentation algorithms.Method: Proposes FOSCU with Duo-Diffusion (3D latent diffusion model with ControlNet) that simultaneously generates synthetic MRI volumes and segmentation labels using segmentation-conditioned diffusion, plus enhanced 3D U-Net training pipeline.
Result: On 720 abdominal MRI scans, models trained with combined real and synthetic data achieved 0.67% mean Dice score gain over real-only training, and 36.4% reduction in Fréchet Inception Distance (FID) indicating enhanced image fidelity.
Conclusion: FOSCU effectively addresses medical image data scarcity by generating high-quality synthetic data that improves segmentation performance when combined with real data, demonstrating the value of diffusion models for medical imaging.
Abstract: Medical image segmentation faces fundamental challenges including restricted access, costly annotation, and data shortage to clinical datasets through Picture Archiving and Communication Systems (PACS). These systemic barriers significantly impede the development of robust segmentation algorithms. To address these challenges, we propose FOSCU, which integrates Duo-Diffusion, a 3D latent diffusion model with ControlNet that simultaneously generates high-resolution, anatomically realistic synthetic MRI volumes and corresponding segmentation labels, and an enhanced 3D U-Net training pipeline. Duo-Diffusion employs segmentation-conditioned diffusion to ensure spatial consistency and precise anatomical detail in the generated data. Experimental evaluation on 720 abdominal MRI scans shows that models trained with combined real and synthetic data yield a mean Dice score gain of 0.67% over those using only real data, and achieve a 36.4% reduction in Fréchet Inception Distance (FID), reflecting enhanced image fidelity.
[142] CIPHER: Counterfeit Image Pattern High-level Examination via Representation
Kyeonghun Kim, Youngung Han, Seoyoung Ju, Yeonju Jean, YooHyun Kim, Minseo Choi, SuYeon Lim, Kyungtae Park, Seungwoo Baek, Sieun Hyeon, Nam-Joon Kim, Hyuk-Jae Lee
Main category: cs.CV
TL;DR: CIPHER is a deepfake detection framework that reuses and fine-tunes GAN/diffusion model discriminators to capture generation-agnostic artifacts, achieving superior cross-model detection performance.
Details
Motivation: As generative models (GANs, diffusion models) create increasingly realistic synthetic faces, the risks of misinformation, fraud, and identity abuse grow. Current detectors struggle with robustness across diverse generative models, creating an urgent need for more generalizable detection systems.Method: CIPHER systematically reuses and fine-tunes discriminators originally trained for image generation. It extracts scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models to capture generation-agnostic artifacts that conventional detectors overlook.
Result: Extensive experiments across nine state-of-the-art generative models show CIPHER achieves up to 74.33% F1-score, outperforming existing ViT-based detectors by over 30% in F1-score on average. It maintains robust performance on challenging datasets (up to 88% F1-score on CIFAKE) where baseline methods fail.
Conclusion: CIPHER validates the effectiveness of discriminator reuse and cross-model fine-tuning, establishing a promising approach for building more generalizable and robust deepfake detection systems in the era of rapidly evolving generative technologies.
Abstract: The rapid progress of generative adversarial networks (GANs) and diffusion models has enabled the creation of synthetic faces that are increasingly difficult to distinguish from real images. This progress, however, has also amplified the risks of misinformation, fraud, and identity abuse, underscoring the urgent need for detectors that remain robust across diverse generative models. In this work, we introduce Counterfeit Image Pattern High-level Examination via Representation(CIPHER), a deepfake detection framework that systematically reuses and fine-tunes discriminators originally trained for image generation. By extracting scale-adaptive features from ProGAN discriminators and temporal-consistency features from diffusion models, CIPHER captures generation-agnostic artifacts that conventional detectors often overlook. Through extensive experiments across nine state-of-the-art generative models, CIPHER demonstrates superior cross-model detection performance, achieving up to 74.33% F1-score and outperforming existing ViT-based detectors by over 30% in F1-score on average. Notably, our approach maintains robust performance on challenging datasets where baseline methods fail, with up to 88% F1-score on CIFAKE compared to near-zero performance from conventional detectors. These results validate the effectiveness of discriminator reuse and cross-model fine-tuning, establishing CIPHER as a promising approach toward building more generalizable and robust deepfake detection systems in an era of rapidly evolving generative technologies.
[143] Uncertainty-Aware Trajectory Prediction: A Unified Framework Harnessing Positional and Semantic Uncertainties
Jintao Sun, Hu Zhang, Gangyi Ding, Zhedong Zheng
Main category: cs.CV
TL;DR: Uncertainty-Aware Trajectory Prediction (UATP) framework that jointly models positional and semantic uncertainties in real-time maps to improve trajectory forecasting for vehicles and pedestrians.
Details
Motivation: Real-time maps used for trajectory prediction have inherent uncertainties from sensor limitations/occlusions (positional errors) and scene misinterpretations (semantic errors). Current methods don't explicitly model these uncertainties, limiting prediction robustness.Method: Proposes a unified framework with dual-head architecture that independently estimates semantic and positional predictions in dual-pass manner, deriving prediction variances as uncertainty indicators. These uncertainties are fused with predictions to enhance trajectory forecasts.
Result: Evaluated on nuScenes dataset across 4 map estimation methods and 2 trajectory prediction baselines. Effectively quantifies map uncertainties and consistently improves performance across minADE, minFDE, and Miss Rate metrics.
Conclusion: The uncertainty-aware framework successfully models both positional and semantic uncertainties in maps, enhancing trajectory prediction robustness and performance across various baselines and metrics.
Abstract: Trajectory prediction seeks to forecast the future motion of dynamic entities, such as vehicles and pedestrians, given a temporal horizon of historical movement data and environmental context. A central challenge in this domain is the inherent uncertainty in real-time maps, arising from two primary sources: (1) positional inaccuracies due to sensor limitations or environmental occlusions, and (2) semantic errors stemming from misinterpretations of scene context. To address these challenges, we propose a novel unified framework that jointly models positional and semantic uncertainties and explicitly integrates them into the trajectory prediction pipeline. Our approach employs a dual-head architecture to independently estimate semantic and positional predictions in a dual-pass manner, deriving prediction variances as uncertainty indicators in an end-to-end fashion. These uncertainties are subsequently fused with the semantic and positional predictions to enhance the robustness of trajectory forecasts. We evaluate our uncertainty-aware framework on the nuScenes real-world driving dataset, conducting extensive experiments across four map estimation methods and two trajectory prediction baselines. Results verify that our method (1) effectively quantifies map uncertainties through both positional and semantic dimensions, and (2) consistently improves the performance of existing trajectory prediction models across multiple metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR). Code will available at https://github.com/JT-Sun/UATP.
[144] StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision
Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan Cao
Main category: cs.CV
TL;DR: StereoVGGT adapts a geometry-grounded vision transformer (VGGT) as a specialized backbone for stereo vision tasks by addressing geometric detail degradation through a training-free feature adjustment pipeline.
Details
Motivation: Current stereo vision backbones lack explicit camera pose supervision during pretraining, creating a performance bottleneck. VGGT has 3D geometric priors including camera poses, but direct application to stereo vision suffers from geometric detail degradation.Method: Proposes StereoVGGT by leveraging frozen VGGT with a training-free feature adjustment pipeline to mitigate geometric degradation and utilize embedded camera calibration knowledge for stereo vision tasks.
Result: StereoVGGT-based stereo matching network achieved 1st rank among all published methods on the KITTI benchmark, demonstrating superior performance.
Conclusion: StereoVGGT serves as an effective backbone for stereo vision by addressing geometric degradation and leveraging pretrained 3D geometric knowledge from VGGT.
Abstract: Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo vision yields suboptimal performance. We observe that VGGT suffers from a more significant degradation of geometric details during feature extraction. Such characteristics conflict with the requirements of binocular stereo vision, thereby constraining its efficacy for relative tasks. To bridge this gap, we propose StereoVGGT, a feature backbone specifically tailored for stereo vision. By leveraging the frozen VGGT and introducing a training-free feature adjustment pipeline, we mitigate geometric degradation and harness the latent camera calibration knowledge embedded within the model. StereoVGGT-based stereo matching network achieved the $1^{st}$ rank among all published methods on the KITTI benchmark, validating that StereoVGGT serves as a highly effective backbone for stereo vision.
[145] Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement
Fabian Kabus, Julia Hindel, Jelena Bratulić, Meropi Karakioulaki, Ayush Gupta, Cristina Has, Thomas Brox, Abhinav Valada, Harald Binder
Main category: cs.CV
TL;DR: TriDerm: A multimodal framework for rare skin disease analysis using wound images, masks, and expert reports with expert-guided triplet comparisons to learn clinically meaningful representations.
Details
Motivation: Off-the-shelf foundation models fail to capture clinically meaningful features for rare, heterogeneous diseases like recessive dystrophic epidermolysis bullosa (RDEB). There's a need for methods that can learn from small cohorts and incorporate expert clinical knowledge effectively.Method: TriDerm integrates wound imagery, boundary masks, and expert reports using: 1) visual adaptation with wound-level attention pooling and non-contrastive representation learning, 2) text processing via LLM prompting with comparison queries and soft ordinal embeddings (SOE), and 3) expert-guided triplet judgments for evaluation.
Result: The multimodal fusion achieves 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. Visual and textual modalities capture complementary aspects of wound phenotype.
Conclusion: TriDerm demonstrates that multimodal integration with expert-guided learning can effectively capture clinically meaningful representations for rare diseases, even with small datasets. The approach provides interpretable wound representations and outperforms existing foundation models.
Abstract: Recessive dystrophic epidermolysis bullosa (RDEB) is a rare genetic skin disorder for which clinicians greatly benefit from finding similar cases using images and clinical text. However, off-the-shelf foundation models do not reliably capture clinically meaningful features for this heterogeneous, long-tail disease, and structured measurement of agreement with experts is challenging. To address these gaps, we propose evaluating embedding spaces with expert ordinal comparisons (triplet judgments), which are fast to collect and encode implicit clinical similarity knowledge. We further introduce TriDerm, a multimodal framework that learns interpretable wound representations from small cohorts by integrating wound imagery, boundary masks, and expert reports. On the vision side, TriDerm adapts visual foundation models to RDEB using wound-level attention pooling and non-contrastive representation learning. For text, we prompt large language models with comparison queries and recover medically meaningful representations via soft ordinal embeddings (SOE). We show that visual and textual modalities capture complementary aspects of wound phenotype, and that fusing both modalities yields 73.5% agreement with experts, outperforming the best off-the-shelf single-modality foundation model by over 5.6 percentage points. We make the expert annotation tool, model code and representative dataset samples publicly available.
[146] PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization
Jianpeng Wang, Haoyu Wang, Baoying Chen, Jishen Zeng, Yiming Qin, Yiqi Yang, Zhongjie Ba
Main category: cs.CV
TL;DR: A method for detecting AI-edited images using automated mask annotation, a large dataset (PromptForge-350k), and a triple-stream network with contrastive learning for forgery localization.
Details
Motivation: The rise of prompt-based AI image editing has increased risks of malicious content fabrication and misinformation, but forgery localization methods for these emerging techniques remain under-explored.Method: 1) Automated mask annotation framework using keypoint alignment and semantic space similarity; 2) Construction of PromptForge-350k dataset covering 4 SOTA prompt-based editing models; 3) ICL-Net with triple-stream backbone and intra-image contrastive learning for robust forensic features.
Result: Achieves 62.5% IoU on PromptForge-350k (5.1% improvement over SOTA), strong robustness against degradations (<1% IoU drop), and promising generalization to unseen editing models (41.5% average IoU).
Conclusion: The proposed framework addresses data scarcity in AI-edited image forgery localization and provides an effective solution with strong performance, robustness, and generalization capabilities.
Abstract: The rapid democratization of prompt-based AI image editing has recently exacerbated the risks associated with malicious content fabrication and misinformation. However, forgery localization methods targeting these emerging editing techniques remain significantly under-explored. To bridge this gap, we first introduce a fully automated mask annotating framework that leverages keypoint alignment and semantic space similarity to generate precise ground-truth masks for edited regions. Based on this framework, we construct PromptForge-350k, a large-scale forgery localization dataset covering four state-of-the-art prompt-based AI image editing models, thereby mitigating the data scarcity in this domain. Furthermore, we propose ICL-Net, an effective forgery localization network featuring a triple-stream backbone and intra-image contrastive learning. This design enables the model to capture highly robust and generalizable forensic features. Extensive experiments demonstrate that our method achieves an IoU of 62.5% on PromptForge-350k, outperforming SOTA methods by 5.1%. Additionally, it exhibits strong robustness against common degradations with an IoU drop of less than 1%, and shows promising generalization capabilities on unseen editing models, achieving an average IoU of 41.5%.
[147] Extend3D: Town-Scale 3D Generation
Seungwoo Yoon, Jinmo Kim, Jaesik Park
Main category: cs.CV
TL;DR: Extend3D: Training-free pipeline for 3D scene generation from single images using object-centric 3D generative models with extended latent spaces and patch-wise generation.
Details
Motivation: To overcome limitations of object-centric 3D generative models that have fixed-size latent spaces, making them unsuitable for generating wide 3D scenes from single images.Method: Extends latent space in x/y directions, divides into overlapping patches, applies object-centric 3D generative model to each patch with coupling at each time step. Uses point cloud prior from monocular depth estimator for initialization, refines occluded regions via SDEdit with “under-noising” concept. Optimizes extended latent during denoising with 3D-aware objectives for geometry and texture fidelity.
Result: Demonstrates better results than prior methods in human preference and quantitative experiments for 3D scene generation from single images.
Conclusion: Extend3D provides an effective training-free approach for generating complete 3D scenes from single images by extending latent spaces, patch-wise generation, and 3D-aware optimization.
Abstract: In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.
[148] AA-Splat: Anti-Aliased Feed-forward Gaussian Splatting
Taewoo Suh, Sungpyo Kim, Jongmin Park, Munchurl Kim
Main category: cs.CV
TL;DR: AA-Splat is an anti-aliased feed-forward 3D Gaussian Splatting method that enables robust rendering at any resolution by using opacity-balanced band-limiting to eliminate rendering artifacts from incorrect screen-space dilation filters.
Details
Motivation: Existing FF-3DGS methods suffer from severe rendering artifacts when rendering at out-of-distribution sampling rates due to incorrect screen-space dilation filters. There's a need for robust anti-aliased rendering that works at any resolution.Method: Proposes AA-Splat with Opacity-Balanced Band-Limiting (OBBL): 1) 3D band-limiting post-filter integrates multi-view maximal frequency bounds into feed-forward reconstruction to band-limit 3D scene representations and eliminate degenerate Gaussians; 2) Opacity Balancing integrates all pixel-aligned Gaussian primitives into rendering, compensating for increased overlap between expanded Gaussian primitives.
Result: AA-Splat demonstrates drastic improvements with average 5.4-7.5dB PSNR gains on novel view synthesis performance over state-of-the-art baseline DepthSplat at all resolutions, between 4× and 1/4× sampling rates.
Conclusion: AA-Splat enables robust anti-aliased rendering at any resolution for feed-forward 3D Gaussian Splatting, significantly improving rendering quality across different sampling rates while eliminating artifacts from previous methods.
Abstract: Feed-forward 3D Gaussian Splatting (FF-3DGS) emerges as a fast and robust solution for sparse-view 3D reconstruction and novel view synthesis (NVS). However, existing FF-3DGS methods are built on incorrect screen-space dilation filters, causing severe rendering artifacts when rendering at out-of-distribution sampling rates. We firstly propose an FF-3DGS model, called AA-Splat, to enable robust anti-aliased rendering at any resolution. AA-Splat utilizes an opacity-balanced band-limiting (OBBL) design, which combines two components: a 3D band-limiting post-filter integrates multi-view maximal frequency bounds into the feed-forward reconstruction pipeline, effectively band-limiting the resulting 3D scene representations and eliminating degenerate Gaussians; an Opacity Balancing (OB) to seamlessly integrate all pixel-aligned Gaussian primitives into the rendering process, compensating for the increased overlap between expanded Gaussian primitives. AA-Splat demonstrates drastic improvements with average 5.4$\sim$7.5dB PSNR gains on NVS performance over a state-of-the-art (SOTA) baseline, DepthSplat, at all resolutions, between $4\times$ and $1/4\times$. Code will be made available.
[149] Hallucination-aware intermediate representation edit in large vision-language models
Wei Suo, Hanzu Zhang, Lijun Zhang, Ji Ma, Peng Wang, Yanning Zhang
Main category: cs.CV
TL;DR: A framework for dynamically detecting and editing hallucination representations in Large Vision-Language Models with minimal computational overhead.
Details
Motivation: Current hallucination mitigation methods (retraining and Contrastive Decoding) have practical limitations: retraining requires substantial resources, and CD introduces dual inference overhead, hindering real-world applicability.Method: Proposes a framework that dynamically detects hallucination representations and performs hallucination-eliminating edits on these representations with minimal additional computational cost.
Result: Achieves state-of-the-art performance on existing benchmarks, demonstrating efficient and robust hallucination elimination with powerful controllability over hallucinations.
Conclusion: The proposed approach effectively addresses hallucination issues in Vision-Language Models with practical efficiency, offering a viable solution for real-world deployment.
Abstract: Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining methods require substantial training resources, and CD methods introduce dual inference overhead. These factors hinder their practical applicability. To address the above issue, we propose a framework for dynamically detecting hallucination representations and performing hallucination-eliminating edits on these representations. With minimal additional computational cost, we achieve state-of-the-art performance on existing benchmarks. Extensive experiments demonstrate the effectiveness of our approach, highlighting its efficient and robust hallucination elimination capability and its powerful controllability over hallucinations. Code is available at https://github.com/ASGO-MM/HIRE
[150] AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Yubo Cui, Xianchao Guan, Zijun Xiong, Zheng Zhang
Main category: cs.CV
TL;DR: AGFT framework enhances zero-shot adversarial robustness in vision-language models while preserving cross-modal alignment through soft alignment distributions and distribution consistency calibration.
Details
Motivation: Existing adversarial fine-tuning methods for VLMs disrupt cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. There's a need for methods that enhance adversarial robustness while preserving the semantic structure between vision and language modalities.Method: Proposes Alignment-Guided Fine-Tuning (AGFT) framework that uses probabilistic predictions of the original model for text-guided adversarial training. It aligns adversarial visual features with textual embeddings via soft alignment distributions and introduces a distribution consistency calibration mechanism to adjust robust model output to match temperature-scaled pre-trained model predictions.
Result: Extensive experiments across multiple zero-shot benchmarks show AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.
Conclusion: AGFT effectively enhances adversarial robustness in VLMs while maintaining cross-modal alignment, addressing the trade-off between robustness and zero-shot generalization that plagues existing methods.
Abstract: Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal semantic structure. Unlike label-based methods that rely on hard labels and fail to maintain the relative relationships between image and text, AGFT leverages the probabilistic predictions of the original model for text-guided adversarial training, which aligns adversarial visual features with textual embeddings via soft alignment distributions, improving zero-shot adversarial robustness. To address structural discrepancies introduced by fine-tuning, we introduce a distribution consistency calibration mechanism that adjusts the robust model output to match a temperature-scaled version of the pre-trained model predictions. Extensive experiments across multiple zero-shot benchmarks demonstrate that AGFT outperforms state-of-the-art methods while significantly improving zero-shot adversarial robustness.
[151] Native-Domain Cross-Attention for Camera-LiDAR Extrinsic Calibration Under Large Initial Perturbations
Ni Ou, Zhuo Chen, Xinru Zhang, Junzheng Wang
Main category: cs.CV
TL;DR: Proposed extrinsic-aware cross-attention framework for camera-LiDAR calibration that directly aligns image patches and LiDAR point groups without projecting to 2D depth maps, improving accuracy under large misalignments.
Details
Motivation: Existing learning-based camera-LiDAR fusion methods project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when extrinsic initialization is far from ground truth. Need for more robust calibration under large misalignments.Method: Extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. Explicitly injects extrinsic parameter hypotheses into correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps.
Result: Outperforms state-of-the-art approaches on KITTI and nuScenes benchmarks in both accuracy and robustness. Under large extrinsic perturbations, achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing second-best baseline.
Conclusion: Proposed method enables more robust camera-LiDAR extrinsic calibration by directly modeling cross-modal correspondences in native domains with explicit extrinsic parameter injection, avoiding geometry distortion from 2D projections.
Abstract: Accurate camera-LiDAR fusion relies on precise extrinsic calibration, which fundamentally depends on establishing reliable cross-modal correspondences under potentially large misalignments. Existing learning-based methods typically project LiDAR points into depth maps for feature fusion, which distorts 3D geometry and degrades performance when the extrinsic initialization is far from the ground truth. To address this issue, we propose an extrinsic-aware cross-attention framework that directly aligns image patches and LiDAR point groups in their native domains. The proposed attention mechanism explicitly injects extrinsic parameter hypotheses into the correspondence modeling process, enabling geometry-consistent cross-modal interaction without relying on projected 2D depth maps. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both accuracy and robustness. Under large extrinsic perturbations, our approach achieves accurate calibration in 88% of KITTI cases and 99% of nuScenes cases, substantially surpassing the second-best baseline. We have open sourced our code on https://github.com/gitouni/ProjFusion to benefit the community.
[152] Adversarial Prompt Injection Attack on Multimodal Large Language Models
Meiwen Ding, Song Xia, Chenqi Kong, Xudong Jiang
Main category: cs.CV
TL;DR: Imperceptible visual prompt injection attacks on MLLMs using adaptive text overlays and iterative feature alignment optimization
Details
Motivation: MLLMs are vulnerable to prompt injection attacks, but existing methods use perceptible prompts; this work explores imperceptible visual prompt injection against closed-source MLLMsMethod: Adaptively embeds malicious prompts via bounded text overlays, with iterative optimization aligning attacked image features with malicious visual/textual targets at coarse and fine-grained levels; uses text-rendered visual targets refined during optimization
Result: Superior performance demonstrated on multimodal understanding tasks across multiple closed-source MLLMs compared to existing methods
Conclusion: Imperceptible visual prompt injection is effective against MLLMs, highlighting security vulnerabilities in multimodal AI systems
Abstract: Although multimodal large language models (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.
[153] Multimodal Models Meet Presentation Attack Detection on ID Documents
Marina Villanueva, Juan M. Espin, Juan E. Tapia
Main category: cs.CV
TL;DR: Multimodal models (Paligemma, Llava, Qwen) combining visual and textual features were tested for Presentation Attack Detection on ID documents but performed poorly despite the promising approach.
Details
Motivation: Traditional PAD systems relying solely on visual features are insufficient against sophisticated spoofing attacks on ID documents, motivating the exploration of multimodal approaches that combine visual and textual information for enhanced security.Method: Used pre-trained multimodal models (Paligemma, Llava, Qwen) to combine deep visual embeddings with contextual textual metadata (document type, issuer, date) for presentation attack detection on ID documents.
Result: Experimental results showed that the multimodal models struggled to accurately detect presentation attacks on ID documents, indicating poor performance despite the promising approach.
Conclusion: While multimodal integration represents an advancement in biometric security, current pre-trained multimodal models are not effective for PAD on ID documents, suggesting need for specialized training or different architectures.
Abstract: The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.
[154] A2BFR: Attribute-Aware Blind Face Restoration
Chenxin Zhu, Yushun Fang, Lu Liu, Shibo Yin, Xiaohong Liu, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
Main category: cs.CV
TL;DR: A²BFR is an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation using diffusion transformers and text-image cross-modal attention.
Details
Motivation: Current blind face restoration methods are ill-posed and produce ambiguous solutions, while diffusion-based methods lack controllability and text-guided editing lacks reliable restoration. There's a need to unify high-fidelity reconstruction with controllable attribute manipulation.Method: Uses Diffusion Transformer backbone with unified image-text cross-modal attention, jointly conditioning denoising on degraded inputs and textual prompts. Introduces attribute-aware learning using facial attribute embeddings from an attribute-aware encoder, and semantic dual-training leveraging pairwise attribute variations from the AttrFace-90K dataset.
Result: Achieves state-of-the-art performance: -0.0467 LPIPS improvement and +52.58% attribute accuracy over diffusion-based BFR baselines. Enables fine-grained, prompt-controllable restoration even under severe degradations.
Conclusion: A²BFR successfully unifies high-fidelity face restoration with prompt-controllable generation, addressing the limitations of both restoration-only and editing-only approaches through attribute-aware learning and semantic dual-training.
Abstract: Blind face restoration (BFR) aims to recover high-quality facial images from degraded inputs, yet its inherently ill-posed nature leads to ambiguous and uncontrollable solutions. Recent diffusion-based BFR methods improve perceptual quality but remain uncontrollable, whereas text-guided face editing enables attribute manipulation without reliable restoration. To address these issues, we propose A$^2$BFR, an attribute-aware blind face restoration framework that unifies high-fidelity reconstruction with prompt-controllable generation. Built upon a Diffusion Transformer backbone with unified image-text cross-modal attention, A$^2$BFR jointly conditions the denoising trajectory on both degraded inputs and textual prompts. To inject semantic priors, we introduce attribute-aware learning, which supervises denoising latents using facial attribute embeddings extracted by an attribute-aware encoder. To further enhance prompt controllability, we introduce semantic dual-training, which leverages the pairwise attribute variations in our newly curated AttrFace-90K dataset to enforce attribute discrimination while preserving fidelity. Extensive experiments demonstrate that A$^2$BFR achieves state-of-the-art performance in both restoration fidelity and instruction adherence, outperforming diffusion-based BFR baselines by -0.0467 LPIPS and +52.58% attribute accuracy, while enabling fine-grained, prompt-controllable restoration even under severe degradations.
[155] Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions
Xuesong Wang, Harry Wang
Main category: cs.CV
TL;DR: A tool-guided inference framework for VLMs that addresses systematic bias in optical illusion perception using generic image manipulation tools without model training.
Details
Motivation: VLMs show systematic bias when processing optical illusions, overwhelmingly predicting illusions as "real" regardless of counterfactual modifications. This reveals fundamental limitations in current vision-language understanding.Method: Tool-guided inference framework with off-the-shelf VLM given access to generic image manipulation tools (line drawing, region cropping, side-by-side comparison, channel isolation) plus illusion-type-routing system prompt. Tools produce new immutable image resources in persistent registry for reference and composition throughout reasoning chain.
Result: Generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from validation set to test set with structurally unfamiliar illusion variants. Three empirical observations: strong positive-detection bias, dissociation between spatial reasoning and logical inference, and sensitivity to image compression artifacts.
Conclusion: Tool-guided inference effectively addresses VLM bias in optical illusion perception without training, demonstrating cross-structural generalization and revealing important empirical observations about VLM reasoning limitations.
Abstract: Vision-language models (VLMs) exhibit a systematic bias when confronted with classic optical illusions: they overwhelmingly predict the illusion as “real” regardless of whether the image has been counterfactually modified. We present a tool-guided inference framework for the DataCV 2026 Challenge (Tasks I and II) that addresses this failure mode without any model training. An off-the-shelf vision-language model is given access to a small set of generic image manipulation tools: line drawing, region cropping, side-by-side comparison, and channel isolation, together with an illusion-type-routing system prompt that prescribes which tools to invoke for each perceptual question category. Critically, every tool call produces a new, immutable image resource appended to a persistent registry, so the model can reference and compose any prior annotated view throughout its reasoning chain. Rather than hard-coding illusion-specific modules, this generic-tool-plus-routing design yields strong cross-structural generalization: performance remained consistent from the validation set to a test set containing structurally unfamiliar illusion variants (e.g., Mach Bands rotated from vertical to horizontal stacking). We further report three empirical observations that we believe warrant additional investigation: (i) a strong positive-detection bias likely rooted in imbalanced illusion training data, (ii) a striking dissociation between pixel-accurate spatial reasoning and logical inference over self-generated annotations, and (iii) pronounced sensitivity to image compression artifacts that compounds false positives.
[156] SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering
Wenli Li, Kai Zhao, Haoran Jiang, Enquan Yang, Yi Su, Dan Zeng
Main category: cs.CV
TL;DR: SeGPruner is a semantic-aware and geometry-guided token reduction framework for efficient 3D question answering with multi-view images, reducing visual tokens by 91% while maintaining competitive performance.
Details
Motivation: Current 3D QA pipelines using multi-view images suffer from severe token redundancy when aggregating multiple viewpoints, leading to inefficient inference under constrained token budgets. Existing token pruning methods are tailored for 2D inputs or rely on indirect geometric cues, limiting their ability to retain semantically critical objects and maintain spatial coverage for robust 3D reasoning.Method: Proposes SeGPruner with two components: 1) Saliency-aware Token Selector that preserves semantically salient tokens using attention-based importance, and 2) Geometry-aware Token Diversifier that complements with spatially diverse tokens by jointly considering semantic relevance and 3D geometric distance. This balances object-level evidence and global scene coverage under aggressive token reduction.
Result: Extensive experiments on ScanQA and OpenEQA show SeGPruner reduces visual token budget by 91% and inference latency by 86% while maintaining competitive performance in 3D reasoning tasks.
Conclusion: SeGPruner effectively addresses token redundancy in 3D QA by combining semantic saliency preservation with geometry-guided diversification, achieving significant efficiency improvements without compromising reasoning performance.
Abstract: Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.
[157] Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Main category: cs.CV
TL;DR: ViTAS introduces a multi-stage multimodal summarization pipeline for radiology reports that selectively focuses on pathology-relevant visual patches rather than full images, achieving state-of-the-art performance by demonstrating that less but more relevant visual input is superior.
Details
Motivation: Existing multimodal models for radiology report summarization struggle with visual noise and fail to meaningfully improve over text-only baselines. The paper challenges two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail.Method: ViTAS (Visual-Text Attention Summarizer) is a multi-stage pipeline combining ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a Vision Transformer (ViT). The approach selectively focuses on pathology-relevant visual patches rather than full images.
Result: Achieves state-of-the-art results with 29.25% BLEU-4 and 69.83% ROUGE-L on MIMIC-CXR benchmark, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores.
Conclusion: Less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization, challenging the assumption that more visual input is always better.
Abstract: Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.
[158] EarthEmbeddingExplorer: A Web Application for Cross-Modal Retrieval of Global Satellite Images
Yijie Zheng, Weijie Wu, Bingyue Wu, Long Zhao, Guoqing Li, Mikolaj Czerkawski, Konstantin Klemmer
Main category: cs.CV
TL;DR: EarthEmbeddingExplorer is an interactive web app that makes Earth observation foundation models and embeddings accessible through cross-modal queries (text, visual, geolocation) for practical scientific workflows.
Details
Motivation: Despite advances in Earth observation foundation models and global embedding datasets, there's a gap in translating these academic assets into freely accessible, practical tools for researchers.Method: Developed EarthEmbeddingExplorer, a cloud-native web application with interactive interface for cross-modal queries using natural language, visual inputs, and geolocation to explore precomputed Earth embeddings.
Result: Created a publicly available web application that enables researchers to derive scientific insights from Earth observation embeddings through intuitive query interfaces and retrieval workflows.
Conclusion: The tool democratizes access to Earth observation embeddings, bridging the gap between academic research and practical applications by providing interactive exploration capabilities.
Abstract: While the Earth observation community has witnessed a surge in high-impact foundation models and global Earth embedding datasets, a significant barrier remains in translating these academic assets into freely accessible tools. This tutorial introduces EarthEmbeddingExplorer, an interactive web application designed to bridge this gap, transforming static research artifacts into dynamic, practical workflows for discovery. We will provide a comprehensive hands-on guide to the system, detailing its cloud-native software architecture, demonstrating cross-modal queries (natural language, visual, and geolocation), and showcasing how to derive scientific insights from retrieval results. By democratizing access to precomputed Earth embeddings, this tutorial empowers researchers to seamlessly transition from state-of-the-art models and data archives to real-world application and analysis. The web application is available at https://modelscope.ai/studios/Major-TOM/EarthEmbeddingExplorer.
[159] NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification
Youngung Han, Minkyung Cha, Kyeonghun Kim, Induk Um, Myeongbin Sho, Joo Young Bae, Jaewon Jung, Jung Hyeok Park, Seojun Lee, Nam-Joon Kim, Woo Kyoung Jeong, Won Jae Lee, Pa Hong, Ken Ying-Kai Liao, Hyuk-Jae Lee
Main category: cs.CV
TL;DR: NeoNet is a 3D deep learning framework for predicting perineural invasion in cholangiocarcinoma using tumor segmentation, synthetic data generation, and specialized attention-based classification.
Details
Motivation: To develop a noninvasive diagnostic method for perineural invasion (PNI) in cholangiocarcinoma, reducing the need for invasive procedures while addressing the challenge of inconsistent imaging criteria for PNI identification.Method: Three-module framework: 1) NeoSeg with Tumor-Localized ROI Crop for segmentation, 2) NeoGen using 3D Latent Diffusion Model with ControlNet to generate synthetic image patches for dataset balancing, and 3) NeoCls with PNI-Attention Network using frozen LDM encoder and 3D Dual Attention Blocks for classification.
Result: NeoNet achieved maximum AUC of 0.7903 in 5-fold cross-validation, outperforming baseline 3D models for PNI prediction in cholangiocarcinoma.
Conclusion: The integrated deep learning framework demonstrates potential for noninvasive PNI diagnosis without relying on predefined image features, showing improved performance over baseline models.
Abstract: Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.
[160] Few-shot Writer Adaptation via Multimodal In-Context Learning
Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet
Main category: cs.CV
TL;DR: A context-driven HTR framework using multimodal in-context learning for few-shot writer adaptation without parameter updates at inference time.
Details
Motivation: Current HTR models struggle with underrepresented handwriting styles, and existing writer adaptation methods require computationally expensive fine-tuning or parameter updates at inference time.Method: Proposes a context-driven HTR framework inspired by multimodal in-context learning, using a compact 8M-parameter CNN-Transformer that enables few-shot adaptation with only examples from target writer, no parameter updates needed.
Result: Achieves Character Error Rates of 3.92% on IAM and 2.34% on RIMES, surpassing all writer-independent HTR models without inference-time parameter updates.
Conclusion: The context-driven approach enables effective writer adaptation with minimal computational overhead, and combining it with standard OCR training yields complementary improvements.
Abstract: While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.
[161] FedDBP: Enhancing Federated Prototype Learning with Dual-Branch Features and Personalized Global Fusion
Ningzhi Gao, Siquan Huang, Leyu Shi, Ying Gao
Main category: cs.CV
TL;DR: FedDBP: A federated prototype learning method with dual-branch feature projection for fidelity and discriminability, plus personalized global prototype fusion using Fisher information.
Details
Motivation: Existing federated prototype learning methods fail to balance feature fidelity and discriminability, and are limited by single global prototypes, which hinders performance in heterogeneous federated learning scenarios.Method: Client-side: Dual-branch feature projector using L2 alignment and contrastive learning simultaneously. Server-side: Personalized global prototype fusion leveraging Fisher information to identify important channels of local prototypes.
Result: Extensive experiments demonstrate superiority over ten existing advanced methods in federated learning benchmarks.
Conclusion: FedDBP effectively addresses feature fidelity-discriminability trade-off and single prototype limitations in federated prototype learning, showing state-of-the-art performance.
Abstract: Federated prototype learning (FPL), as a solution to heterogeneous federated learning (HFL), effectively alleviates the challenges of data and model heterogeneity.However, existing FPL methods fail to balance the fidelity and discriminability of the feature, and are limited by a single global prototype. In this paper, we propose FedDBP, a novel FPL method to address the above issues. On the client-side, we design a Dual-Branch feature projector that employs L2 alignment and contrastive learning simultaneously, thereby ensuring both the fidelity and discriminability of local features. On the server-side, we introduce a Personalized global prototype fusion approach that leverages Fisher information to identify the important channels of local prototypes. Extensive experiments demonstrate the superiority of FedDBP over ten existing advanced methods.
[162] Square Superpixel Generation and Representation Learning via Granular Ball Computing
Shuyin Xia, Meng Yang, Dawei Dai, Fan Chen, Shilin Zhao, Junwei Han, Xinbo Gao, Guoyin Wang, Wen Lu
Main category: cs.CV
TL;DR: A square superpixel generation method using multi-scale square blocks for efficient parallel processing and integration with deep learning architectures like GNNs and ViTs.
Details
Motivation: Existing superpixel algorithms produce irregularly shaped regions that don't align well with regular operators like convolutions, limiting parallel implementation and end-to-end optimization in deep learning pipelines.Method: Develop a square superpixel approach using multi-scale square blocks, compute purity scores based on pixel-intensity similarity for each block, and select high-quality blocks to create regular-shaped superpixels.
Result: Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.
Conclusion: Square superpixels enable efficient parallel processing and learnable feature extraction, and can be readily integrated as graph nodes in GNNs or tokens in Vision Transformers for multi-scale information aggregation.
Abstract: Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.
[163] VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
Anmin Liu, Ruixuan Yang, Huiqiang Jiang, Bin Lin, Minmin Sun, Yong Li, Chen Zhang, Tao Xie
Main category: cs.CV
TL;DR: VecAttention: A vector-wise sparse attention framework for efficient long-context video understanding and generation that leverages vertical-vector sparse patterns in video attention maps.
Details
Motivation: Long-context video understanding and generation face computational challenges due to quadratic complexity of self-attention in Transformers. Existing sparse attention methods use coarse-grained patterns leading to redundant computation and suboptimal performance.Method: Proposes VecAttention framework that dynamically selects and processes only informative vertical vectors through lightweight important-vector selection and optimized kernel for vector sparse attention, based on observation that video attention maps exhibit strong vertical-vector sparse patterns.
Result: Achieves 2.65× speedup over full attention and 1.83× speedup over state-of-the-art sparse attention methods on video understanding (VideoMME, LongVideoBench, VCRBench) and generation (VBench) tasks, with comparable accuracy to full attention.
Conclusion: VecAttention provides superior accuracy-efficiency trade-offs for video models by exploiting vertical-vector sparse patterns in attention maps, enabling efficient long-context video processing.
Abstract: Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose \textbf{VecAttention}, a novel framework of vector-wise sparse attention that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy-sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized kernel of vector sparse attention. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\times$ speedup over full attention and a 1.83$\times$ speedup over state-of-the-art sparse attention methods, with comparable accuracy to full attention. Our code is available at https://github.com/anminliu/VecAttention.
[164] All-in-One Augmented Reality Guided Head and Neck Tumor Resection
Yue Yang, Matthieu Chabanas, Carrie Reale, Annie Benson, Jason Slagle, Matthew Weinger, Michael Topf, Jie Ying Wu
Main category: cs.CV
TL;DR: AR system using HoloLens 2 with markerless registration to visualize positive cancer margins in situ for more precise surgical re-excision
Details
Motivation: Current intraoperative re-resection of positive margins in head and neck cancer is imprecise because margin locations are communicated verbally from pathology, leading to inaccurate localization during surgeryMethod: Developed an all-in-one augmented reality system using HoloLens 2 depth sensing with fully automated markerless surface registration to relocalize positive margins from resected specimens to the surgical resection bed
Result: Markerless registration achieved target registration errors comparable to marker-based baseline (median 1.8mm vs 1.7mm). AR guidance reduced localization error from 14.2mm with verbal guidance to 3.2mm, with all AR localizations within 5mm error
Conclusion: The system demonstrates feasibility of markerless AR margin guidance for more precise intraoperative re-excision in head and neck cancer surgery
Abstract: Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.
[165] Transmittance-Guided Structure-Texture Decomposition for Nighttime Image Dehazing
Francesco Moretti, Giulia Bianchi, Andrea Gallo
Main category: cs.CV
TL;DR: Two-stage nighttime image dehazing framework combining transmittance correction with structure-texture layered optimization to address multiple degradation factors in hazy nighttime images.
Details
Motivation: Nighttime hazy images suffer from severe quality degradation including low visibility, color distortion, and reduced contrast due to atmospheric scattering, particle absorption, and non-uniform artificial lighting. Existing methods only address subsets of these issues without tackling the full spectrum of degradation factors.Method: Two-stage framework: 1) Transmittance correction with boundary-constrained initial transmittance maps, region-adaptive compensation/normalization, and quadratic Gaussian filtering in YUV space for atmospheric light estimation; 2) STAR-YUV decomposition separating image into structure/texture layers, applying gamma correction and MSRCR to structure layer for illumination/color correction, and Laplacian-of-Gaussian filtering to texture layer for detail enhancement, followed by two-phase fusion strategy.
Result: The method produces improved nighttime dehazing results by jointly addressing multiple degradation factors including glow suppression, brightness enhancement, color correction, and detail preservation.
Conclusion: The proposed framework effectively handles the complex degradation in nighttime hazy images through integrated transmittance correction and structure-texture optimization, outperforming existing methods that only address subsets of the problems.
Abstract: Nighttime images captured under hazy conditions suffer from severe quality degradation, including low visibility, color distortion, and reduced contrast, caused by the combined effects of atmospheric scattering, absorption by suspended particles, and non-uniform illumination from artificial light sources. While existing nighttime dehazing methods have achieved partial success, they typically address only a subset of these issues, such as glow suppression or brightness enhancement, without jointly tackling the full spectrum of degradation factors. In this paper, we propose a two-stage nighttime image dehazing framework that integrates transmittance correction with structure-texture layered optimization. In the first stage, we introduce a novel transmittance correction method that establishes boundary-constrained initial transmittance maps and subsequently applies region-adaptive compensation and normalization based on whether image regions correspond to light source areas. A quadratic Gaussian filtering scheme operating in the YUV color space is employed to estimate the spatially varying atmospheric light map. The corrected transmittance map and atmospheric light map are then used in conjunction with an improved nighttime imaging model to produce the initial dehazed image. In the second stage, we propose a STAR-YUV decomposition model that separates the dehazed image into structure and texture layers within the YUV color space. Gamma correction and MSRCR-based color restoration are applied to the structure layer for illumination compensation and color bias correction, while Laplacian-of-Gaussian filtering is applied to the texture layer for detail enhancement. A novel two-phase fusion strategy, comprising nonlinear Retinex-based fusion of the enhanced layers followed by linear blending with the initial dehazing result, yields the final output.
[166] Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge
Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati
Main category: cs.CV
TL;DR: A unified framework for multi-task GenAI inference on edge devices using shared models with LoRA weights as runtime inputs and QUAD quantization strategy.
Details
Motivation: Deploying Large Vision Models (LVMs) on resource-constrained mobile devices is challenging due to high memory/compute requirements. Existing mobile deployment pipelines compile separate model binaries for each LoRA + foundation model, causing redundant storage and runtime overhead.Method: Treat LoRA weights as runtime inputs rather than embedding them into compiled model graph, enabling dynamic task switching without recompilation. Introduce QUAD (Quantization with Unified Adaptive Distillation), a quantization-aware training strategy that aligns multiple LoRA adapters under a shared quantization profile. Implement lightweight runtime stack compatible with mobile NPUs.
Result: Experimental results show up to 6x reduction in memory footprint and 4x latency improvements while maintaining high visual quality across multiple GenAI tasks. System works across multiple chipsets.
Conclusion: Proposed framework enables efficient multi-task GenAI inference on edge devices with significant memory and latency improvements through unified model approach and QUAD quantization strategy.
Abstract: Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.
[167] Generating Key Postures of Bharatanatyam Adavus with Pose Estimation
Jagadish Kashinath Kamble, Jayanta Mukhopadhyay, Debaditya Roy, Partha Pratim Das
Main category: cs.CV
TL;DR: A pose-aware generative framework for synthesizing Bharatanatyam dance postures using pose estimation guidance to ensure anatomical accuracy and cultural fidelity.
Details
Motivation: To digitally preserve intangible cultural dances like Bharatanatyam by accurately generating key postures while maintaining anatomical and stylistic integrity for documentation, analysis, and global dissemination.Method: Proposes a pose-aware generative framework integrated with pose estimation module, using keypoint-based loss and pose consistency constraints. Evaluates four configurations: standard cGAN, cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision, all conditioned on posture class labels.
Result: Incorporating pose supervision significantly enhances quality, realism, and authenticity of generated Bharatanatyam postures in both cGAN and conditional diffusion settings, aligning generated poses with ground-truth keypoint structures.
Conclusion: The framework provides a scalable approach for digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision.
Abstract: Preserving intangible cultural dances rooted in centuries of tradition and governed by strict structural and symbolic rules presents unique challenges in the digital era. Among these, Bharatanatyam, a classical Indian dance form, stands out for its emphasis on codified adavus and precise key postures. Accurately generating these postures is crucial not only for maintaining anatomical and stylistic integrity, but also for enabling effective documentation, analysis, and transmission to broader global audiences through digital means. We propose a pose-aware generative framework integrated with a pose estimation module, guided by keypoint-based loss and pose consistency constraints. These supervisory signals ensure anatomical accuracy and stylistic integrity in the synthesized outputs. We evaluate four configurations: standard conditional generative adversarial network (cGAN), cGAN with pose supervision, conditional diffusion, and conditional diffusion with pose supervision. Each model is conditioned on key posture class labels and optimized to maintain geometric structure. In both cGAN and conditional diffusion settings, the integrated pose guidance aligns generated poses with ground-truth keypoint structures, promoting cultural fidelity. Our results demonstrate that incorporating pose supervision significantly enhances the quality, realism, and authenticity of generated Bharatanatyam postures. This framework provides a scalable approach for the digital preservation, education, and dissemination of traditional dance forms, enabling high-fidelity generation without compromising cultural precision. Code is available at https://github.com/jagidsh/Generating-Key-Postures-of-Bharatanatyam-Adavus-with-Pose-Estimation.
[168] Emotion Diffusion Classifier with Adaptive Margin Discrepancy Training for Facial Expression Recognition
Rongkang Dong, Cuixin Yang, Cong Zhang, Yushen Zuo, Kin-Man Lam
Main category: cs.CV
TL;DR: Proposes AMDiT (Adaptive Margin Discrepancy Training) to improve Emotion Diffusion Classifier (EmoDC) for facial expression recognition, enhancing both accuracy and robustness against noise/blur.
Details
Motivation: Current deep learning FER methods rely on discriminative classifiers that learn shortcuts and are vulnerable to distribution shifts. Diffusion models offer robustness but standard training fails to penalize incorrect categorical descriptions.Method: Introduces Emotion Diffusion Classifier (EmoDC) using conditional generative diffusion model. Proposes Adaptive Margin Discrepancy Training (AMDiT) that dynamically adjusts margins between noise-prediction errors for correct vs incorrect categories per sample.
Result: AMDiT significantly improves EmoDC accuracy over base model on RAF-DB basic/compound subsets, SFEW-2.0, and AffectNet. EmoDC outperforms state-of-the-art discriminative classifiers in robustness against noise and blur.
Conclusion: AMDiT effectively enhances diffusion-based FER by adaptively enforcing discriminative capability, achieving both improved accuracy and robustness compared to traditional discriminative approaches.
Abstract: Facial Expression Recognition (FER) is essential for human-machine interaction, as it enables machines to interpret human emotions and internal states from facial affective behaviors. Although deep learning has significantly advanced FER performance, most existing deep-learning-based FER methods rely heavily on discriminative classifiers for fast predictions. These models tend to learn shortcuts and are vulnerable to even minor distribution shifts. To address this issue, we adopt a conditional generative diffusion model and introduce the Emotion Diffusion Classifier (EmoDC) for FER, which demonstrates enhanced adversarial robustness. However, retraining EmoDC using standard strategies fails to penalize incorrect categorical descriptions, leading to suboptimal recognition performance. To improve EmoDC, we propose margin-based discrepancy training, which encourages accurate predictions when conditioned on correct categorical descriptions and penalizes predictions conditioned on mismatched ones. This method enforces a minimum margin between noise-prediction errors for correct and incorrect categories, thereby enhancing the model’s discriminative capability. Nevertheless, using a fixed margin fails to account for the varying difficulty of noise prediction across different images, limiting its effectiveness. To overcome this limitation, we propose Adaptive Margin Discrepancy Training (AMDiT), which dynamically adjusts the margin for each sample. Extensive experiments show that AMDiT significantly improves the accuracy of EmoDC over the Base model with standard denoising diffusion training on the RAF-DB basic subset, the RAF-DB compound subset, SFEW-2.0, and AffectNet, in 100-step evaluations. Additionally, EmoDC outperforms state-of-the-art discriminative classifiers in terms of robustness against noise and blur.
[169] FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models
Jules Ripoll, David Bertoin, Alasdair Newson, Charles Dossal, Jose Pablo Baraybar
Main category: cs.CV
TL;DR: FlowID: An identity-preserving facial reconstruction method using image generation models to restore faces of violent death victims while preserving identity for identification purposes.
Details
Motivation: Address the challenge of identifying deceased individuals from violent circumstances (crimes, war, migration, climate disasters) where traditional photo editing tools produce suboptimal results and have lengthy workflows. Need for automated methods that can remove artifacts while preserving identity-critical features.Method: Combines single-image fine-tuning to adapt generative models to out-of-distribution injured faces with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Introduces InjuredFaces benchmark for evaluation.
Result: FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.
Conclusion: Proposes a practical solution for medico-legal and law enforcement applications that balances reconstruction quality with identity preservation, with potential for real-world deployment in sensitive contexts.
Abstract: Every day, many people die under violent circumstances, whether from crimes, war, migration, or climate disasters. Medico-legal and law enforcement institutions document many portraits of the deceased for evidence, but cannot immediately carry out identification on them. While traditional image editing tools can process these photos for public release, the workflow is lengthy and produces suboptimal results. In this work, we leverage advances in image generation models, which can now produce photorealistic human portraits, to introduce FlowID, an identity-preserving facial reconstruction method. Our approach combines single-image fine-tuning, which adapts the generative model to out-of-distribution injured faces, with attention-based masking that localizes edits to damaged regions while preserving identity-critical features. Together, these components enable the removal of artifacts from violent death while retaining sufficient identity information to support identification. To evaluate our method, we introduce InjuredFaces, a novel benchmark for identity-preserving facial reconstruction under severe facial damage. Beyond serving as an evaluation tool for this work, InjuredFaces provides a standardized resource for the community to study and compare methods addressing facial reconstruction in extreme conditions. Experimental results show that FlowID outperforms state-of-the-art open-source methods while maintaining low memory requirements, making it suitable for local deployment without compromising data privacy.
[170] Video-Oasis: Rethinking Evaluation of Video Understanding
Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee, Taeoh Kim, Dongyoon Wee, Yukyung Choi
Main category: cs.CV
TL;DR: Video-Oasis is a diagnostic suite that systematically evaluates existing video understanding benchmarks, revealing that over half of samples can be solved without visual/temporal context, and SOTA models perform near random guessing on the remaining challenging samples.
Details
Motivation: Current video understanding benchmarks lack clear attribution of performance gains to visual perception, linguistic reasoning, or knowledge priors. The essential criteria for video understanding remain overlooked, with many benchmarks focusing on high-level reasoning without systematic evaluation of what constitutes true video understanding.Method: Developed Video-Oasis, a sustainable diagnostic suite to systematically evaluate existing video understanding evaluations. The method involves analyzing current benchmark samples to determine which can be solved without visual input or temporal context, and testing state-of-the-art models on the remaining challenging samples.
Result: Two critical findings: 1) 54% of existing benchmark samples are solvable without visual input or temporal context, 2) On the remaining challenging samples, state-of-the-art models perform barely above random guessing. The analysis provides practical guidelines for algorithmic design choices that contribute to robust video understanding.
Conclusion: Video-Oasis serves as a standard guideline for benchmark construction and rigorous evaluation of architecture development in video understanding. The work highlights the need for more challenging benchmarks that truly require visual and temporal reasoning.
Abstract: The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at https://github.com/sejong-rcv/Video-Oasis.
[171] AutoWorld: Scaling Multi-Agent Traffic Simulation with Self-Supervised World Models
Mozhgan Pourkeshavatz, Tianran Liu, Nicholas Rhinehart
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.28963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[172] BigEarthNet.txt: A Large-Scale Multi-Sensor Image-Text Dataset and Benchmark for Earth Observation
Johann-Ludwig Herzog, Mathis Jürgen Adler, Leonard Hackel, Yan Shu, Angelos Zavras, Ioannis Papoutsis, Paolo Rota, Begüm Demir
Main category: cs.CV
TL;DR: BigEarthNet.txt is a large-scale multi-sensor remote sensing dataset with 464k images and 9.6M text annotations to advance instruction-driven image-text learning in Earth observation.
Details
Motivation: Existing vision-language models perform well on general computer vision but struggle with remote sensing data due to limited large-scale, multi-sensor datasets with diverse textual annotations. Current RS datasets mainly contain aerial RGB imagery with short captions and lack annotation diversity.Method: Created BigEarthNet.txt dataset containing 464,044 co-registered Sentinel-1 SAR and Sentinel-2 multispectral images with 9.6 million text annotations including geographically anchored captions, visual question answering pairs, and referring expression detection instructions.
Result: BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation variety. Fine-tuning VLMs on this dataset yields consistent performance gains across tasks, though models still struggle with complex land-use/land-cover classes.
Conclusion: BigEarthNet.txt addresses the data scarcity problem in remote sensing vision-language learning and enables better instruction-driven image-text models for Earth observation tasks through comprehensive multi-sensor, multi-annotation training data.
Abstract: Vision-langugage models (VLMs) have shown strong performance in computer vision (CV), yet their performance on remote sensing (RS) data remains limited due to the lack of large-scale, multi-sensor RS image-text datasets with diverse textual annotations. Existing datasets predominantly include aerial Red-Green-Blue imagery, with short or weakly grounded captions, and provide limited diversity in annotation types. To address this limitation, we introduce BigEarthNet.txt, a large-scale, multi-sensor image-text dataset designed to advance instruction-driven image-text learning in Earth observation across multiple tasks. BigEarthNet.txt contains 464044 co-registered Sentinel-1 synthetic aperture radar and Sentinel-2 multispectral images with 9.6M text annotations, including: i) geographically anchored captions describing land-use/land-cover (LULC) classes, their spatial relations, and environmental context; ii) visual question answering pairs relevant for different tasks; and iii) referring expression detection instructions for bounding box prediction. Through a comparative statistical analysis, we demonstrate that BigEarthNet.txt surpasses existing RS image-text datasets in textual richness and annotation type variety. We further establish a manually-verified benchmark split to evaluate VLMs in RS and CV. The results show the limitations of these models on tasks that involve complex LULC classes, whereas fine-tuning using BigEarthNet.txt results in consistent performance gains across all considered tasks.
[173] Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
Sherif Abdelwahab
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.29631: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29631&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[174] Self-Supervised Federated Learning under Data Heterogeneity for Label-Scarce Diatom Classification
Mingkun Tan, Xilu Wang, Michael Kloster, Tim W. Nattkemper
Main category: cs.CV
TL;DR: SSFL for diatom classification with systematic study of stage-specific data heterogeneity, proposing PreDi partitioning scheme and PreP-WFL method to address label-space heterogeneity in decentralized systems.
Details
Motivation: Address label-scarce visual classification under decentralized and heterogeneous data, particularly when sites have partially overlapping class sets. Existing SSFL approaches assume same data heterogeneity patterns throughout training and lack controllable simulation of real-world label-space heterogeneity.Method: Introduce SSFL for diatom classification as real-world case study. Propose PreDi partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions: class Prevalence and class-set size Disparity. Develop PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios.
Result: SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. Heterogeneity in unlabeled data volume improves representation pre-training. Under label-space heterogeneity, prevalence dominates performance while disparity has smaller effect. PreP-WFL effectively mitigates degradation, with gains increasing as prevalence decreases.
Conclusion: Provides mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems. PreDi enables separate analysis of prevalence and disparity effects, while PreP-WFL addresses challenges in low-prevalence scenarios.
Abstract: Label-scarce visual classification under decentralized and heterogeneous data is a fundamental challenge in pattern recognition, especially when sites exhibit partially overlapping class sets. While self-supervised federated learning (SSFL) offers a promising solution, existing studies commonly assume the same data heterogeneity pattern throughout pre-training and fine-tuning. Moreover, current partitioning schemes often fail to generate pure partially class-disjoint data settings, limiting controllable simulation of real-world label-space heterogeneity. In this work, we introduce SSFL for diatom classification as a representative real-world instance and systematically investigate stage-specific data heterogeneity. We study cross-site variation in unlabeled data volume during pre-training and label-space misalignment during downstream fine-tuning. To study the latter in a controllable setting, we propose PreDi, a partitioning scheme that disentangles label-space heterogeneity into two orthogonal dimensions, namely class Prevalence and class-set size Disparity, enabling separate analysis of their effects. Guided by the resulting insights, we further propose PreP-WFL (Prevalence-based Personalized Weighted Federated Learning) to adaptively strengthen rare-class representations in low-prevalence scenarios. Extensive experiments show that SSFL consistently outperforms local-only training under both homogeneous and heterogeneous settings. The pronounced heterogeneity in unlabeled data volume is associated with improved representation pre-training, whereas under label-space heterogeneity, prevalence dominates performance and disparity has a smaller effect. PreP-WFL effectively mitigates this degradation, with gains increasing as prevalence decreases. These findings provide a mechanistic basis for characterizing label-space heterogeneity in decentralized recognition systems.
[175] MacTok: Robust Continuous Tokenization for Image Generation
Hengyu Zeng, Xin Gao, Guanghao Li, Yuxiang Yan, Jiaoyang Ruan, Junpeng Ma, Haoyu Albert Wang, Jian Pu
Main category: cs.CV
TL;DR: MacTok is a masked augmenting 1D continuous tokenizer that uses image masking and representation alignment to prevent posterior collapse in compressed latent spaces, achieving efficient visual generation with only 64-128 tokens.
Details
Motivation: Continuous image tokenizers with KL regularization often suffer from posterior collapse when using fewer tokens, where encoders fail to encode informative features into compressed latent spaces. The authors aim to address this limitation while maintaining efficient visual generation.Method: MacTok uses two types of masking: random masking to regularize latent learning, and DINO-guided semantic masking to emphasize informative image regions. This forces the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, it preserves discriminative information in a highly compressed 1D latent space.
Result: On ImageNet, MacTok achieves competitive gFID scores of 1.44 at 256×256 and state-of-the-art 1.52 at 512×512 with SiT-XL, while reducing token usage by up to 64× compared to previous methods.
Conclusion: Masking and semantic guidance together effectively prevent posterior collapse and enable efficient, high-fidelity tokenization for visual generation tasks.
Abstract: Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce \textbf{MacTok}, a \textbf{M}asked \textbf{A}ugmenting 1D \textbf{C}ontinuous \textbf{Tok}enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\times$256 and a state-of-the-art 1.52 at 512$\times$512 with SiT-XL, while reducing token usage by up to 64$\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.
[176] Not All Frames Are Equal: Complexity-Aware Masked Motion Generation via Motion Spectral Descriptors
Pengfei Zhou, Xiangyue Zhang, Xukun Shen, Yong Hu
Main category: cs.CV
TL;DR: DynMask improves masked motion generation by incorporating Motion Spectral Descriptor (MSD) to account for local dynamic complexity, addressing current models’ uniform treatment of motion frames.
Details
Motivation: Current masked generative models for text-to-motion synthesis treat motion frames uniformly during masking, attention, and decoding, which poorly matches motion's varying local dynamic complexity. The paper shows these models degrade disproportionately on dynamically complex motions.Method: Introduces Motion Spectral Descriptor (MSD), a parameter-free measure of local dynamic complexity from motion velocity spectrum. Uses MSD to guide content-focused masking during training, provide spectral similarity prior for self-attention, and modulate token-level sampling during iterative decoding.
Result: DynMask improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML datasets.
Conclusion: Respecting local motion complexity is a useful design principle for masked motion generation, and MSD provides an effective way to incorporate complexity awareness into existing masked motion generators.
Abstract: Masked generative models have become a strong paradigm for text-to-motion synthesis, but they still treat motion frames too uniformly during masking, attention, and decoding. This is a poor match for motion, where local dynamic complexity varies sharply over time. We show that current masked motion generators degrade disproportionately on dynamically complex motions, and that frame-wise generation error is strongly correlated with motion dynamics. Motivated by this mismatch, we introduce the Motion Spectral Descriptor (MSD), a simple and parameter-free measure of local dynamic complexity computed from the short-time spectrum of motion velocity. Unlike learned difficulty predictors, MSD is deterministic, interpretable, and derived directly from the motion signal itself. We use MSD to make masked motion generation complexity-aware. In particular, MSD guides content-focused masking during training, provides a spectral similarity prior for self-attention, and can additionally modulate token-level sampling during iterative decoding. Built on top of masked motion generators, our method, DynMask, improves motion generation most clearly on dynamically complex motions while also yielding stronger overall FID on HumanML3D and KIT-ML. These results suggest that respecting local motion complexity is a useful design principle for masked motion generation. Project page: https://xiangyue-zhang.github.io/DynMask
[177] CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun
Main category: cs.CV
TL;DR: CutClaw: An autonomous multi-agent framework using multiple MLLMs to edit raw footage into rhythm-aligned short videos with synchronized music and narrative consistency.
Details
Motivation: Manual video editing is time-consuming and repetitive. There's a need for automated systems that can edit raw footage into meaningful short videos with synchronized music and narrative flow.Method: Uses hierarchical multimodal decomposition to capture fine-grained and global structures. Employs a multi-agent system with Playwriter Agent for narrative orchestration, and Editor/Reviewer Agents for collaborative optimization based on aesthetic and semantic criteria.
Result: CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos.
Conclusion: CutClaw demonstrates effective autonomous video editing using multimodal language models, addressing the challenge of manual video editing while maintaining narrative consistency and audio-visual synchronization.
Abstract: Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.
[178] CoRe-DA: Contrastive Regression for Unsupervised Domain Adaptation in Surgical Skill Assessment
Dimitrios Anastasiou, Razvan Caramalau, Jialang Xu, Runlong He, Freweini Tesfai, Matthew Boal, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: A novel contrastive regression-based domain adaptation framework (CoRe-DA) for surgical skill assessment that enables cross-domain generalization without labeled target data, achieving state-of-the-art performance on both dry-lab and clinical datasets.
Details
Motivation: Current surgical skill assessment methods face challenges with high annotation costs and poor generalization to new surgical tasks/environments, while abundant unlabeled video data exists, motivating unsupervised domain adaptation approaches.Method: CoRe-DA learns domain-invariant representations through relative-score supervision and target-domain self-training, using contrastive regression for adaptation across surgical domains without labeled target data.
Result: CoRe-DA outperforms state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets respectively without using any labeled target data for training.
Conclusion: The proposed framework enables scalable surgical skill assessment with reliable cross-domain generalization, addressing limitations of existing methods that underperform in domain adaptation scenarios.
Abstract: Vision-based surgical skill assessment (SSA) enables objective and scalable evaluation of operative performance. Progress in this field is constrained by the high cost and time demands for manual annotation of quantitative skill scores, as well as the poor generalization of existing regression models to new surgical tasks and environments. Meanwhile, appreciable volumes of unlabeled video data are now available, motivating the development of unsupervised domain adaptation (UDA) methods for SSA. We introduce the first benchmark for UDA in SSA regression, spanning four datasets across dry-lab and clinical settings as well as open and robotic surgery. We evaluate eight representative models under challenging domain shifts and propose CoRe-DA, a novel contrastive regression-based adaptation framework. Our method learns domain-invariant representations through relative-score supervision and target-domain self-training. Comprehensive experiments across two UDA settings show that CoRe-DA is superior to state-of-the-art methods, achieving Spearman Correlation Coefficients of 0.46 and 0.41 on dry-lab and clinical target datasets, respectively, without using any labeled target data for training. Overall, CoRe-DA enables scalable SSA with reliable cross-domain generalization, where existing methods underperform. Our code and datasets will be released at https://github.com/anastadimi/CoRe-DA.
[179] Clinical DVH metrics as a loss function for 3D dose prediction in head and neck radiotherapy
Ruochen Gao, Marius Staring, Frank Dankers
Main category: cs.CV
TL;DR: A clinical DVH metric loss (CDM loss) with differentiable D-metrics and surrogate V-metrics improves 3D dose prediction for head and neck radiotherapy by directly optimizing clinically relevant DVH metrics instead of voxel-wise regression.
Details
Motivation: Existing deep learning models for 3D dose prediction use voxel-wise regression losses that don't align well with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics, leading to suboptimal clinical outcomes.Method: Proposed CDM loss with differentiable D-metrics and surrogate V-metrics, plus lossless bit-mask ROI encoding for efficiency. Evaluated on 174 H&N patients using temporal split (137 training, 37 testing) with standard 3D U-Net.
Result: CDM loss substantially improved target coverage and satisfied all clinical constraints. PTV Score reduced from 1.544 (MAE) to 0.491 (MAE + CDM). Bit-mask encoding reduced training time by 83% and lowered GPU memory usage.
Conclusion: Direct optimization of clinical DVH metrics produces 3D dose predictions better aligned with clinical criteria than conventional supervision. CDM loss with efficient ROI encoding provides practical framework for H&N dose prediction.
Abstract: Purpose: Deep-learning-based three-dimensional (3D) dose prediction is widely used in automated radiotherapy workflows. However, most existing models are trained with voxel-wise regression losses, which are poorly aligned with clinical plan evaluation criteria based on dose-volume histogram (DVH) metrics. This study aims to develop a clinically guided loss formulation that directly optimizes clinically used DVH metrics while remaining computationally efficient for head and neck (H&N) dose prediction. Methods: We propose a clinical DVH metric loss (CDM loss) that incorporates differentiable \textit{D-metrics} and surrogate \textit{V-metrics}, together with a lossless bit-mask region-of-interest (ROI) encoding to improve training efficiency. The method was evaluated on 174 H&N patients using a temporal split (137 training, 37 testing). Results: Compared with MAE- and DVH-curve based losses, CDM loss substantially improved target coverage and satisfied all clinical constraints. Using a standard 3D U-Net, the PTV Score was reduced from 1.544 (MAE) to 0.491 (MAE + CDM), while OAR sparing remained comparable. Bit-mask encoding reduced training time by 83% and lowered GPU memory usage. Conclusion: Directly optimizing clinically used DVH metrics enables 3D dose predictions that are better aligned with clinical treatment planning criteria than conventional voxel-wise or DVH-curve-based supervision. The proposed CDM loss, combined with efficient ROI bit-mask encoding, provides a practical and scalable framework for H&N dose prediction.
[180] SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
Ning Wang, Tieyue Wu, Naeha Sharif, Farid Boussaid, Guangming Zhu, Lin Mei, Mohammed Bennamoun, zhang liang
Main category: cs.CV
TL;DR: SkeletonContext: A prompt-based framework for zero-shot skeleton-based action recognition that enriches skeletal motion with language-driven contextual semantics using cross-modal context prompts and key-part decoupling.
Details
Motivation: Existing zero-shot skeleton-based action recognition methods align skeleton features with textual embeddings but lack contextual cues (like objects involved in actions), creating a gap between skeleton and semantic representations that makes it difficult to distinguish visually similar actions.Method: Proposes SkeletonContext with two main components: 1) Cross-Modal Context Prompt Module that uses a pretrained language model to reconstruct masked contextual prompts guided by LLMs, transferring linguistic context to skeleton encoder for semantic grounding; 2) Key-Part Decoupling Module to decouple motion-relevant joint features for robust action understanding without explicit object interactions.
Result: Achieves state-of-the-art performance on multiple benchmarks under both conventional and generalized zero-shot settings, demonstrating effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.
Conclusion: SkeletonContext effectively bridges the gap between skeleton and semantic representations by incorporating language-driven contextual semantics, enabling better cross-modal alignment and improved zero-shot action recognition performance.
Abstract: Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.
[181] Exploring the Impact of Skin Color on Skin Lesion Segmentation
Kuniko Paxton, Medina Kapo, Amila Akagić, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
Main category: cs.CV
TL;DR: The paper evaluates skin lesion segmentation fairness using continuous pigment analysis rather than discrete skin tone categories, finding that low lesion-skin contrast drives segmentation errors more than global skin tone metrics.
Details
Motivation: While fairness concerns in skin lesion classification have been studied, the influence of skin tone on segmentation remains under-quantified. Current assessments use coarse, discrete skin tone categories, which may not capture the nuances of segmentation performance.Method: Evaluated three segmentation architectures (UNet, DeepLabV3 with ResNet50, DINOv2) on two dermoscopic datasets (HAM10000, ISIC2017). Introduced continuous pigment analysis using pixel-wise ITA values as distributions. Used Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions to quantify lesion-skin contrast and relate it to segmentation performance.
Result: Global skin tone metrics (Fitzpatrick grouping or mean ITA) showed weak association with segmentation quality. Low lesion-skin contrast was consistently associated with larger segmentation errors across models, indicating boundary ambiguity and low contrast are key drivers of failure.
Conclusion: Fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions. Distribution-based pigment measures provide more informative audit signals than discrete skin-tone categories for segmentation fairness evaluation.
Abstract: Skin cancer, particularly melanoma, remains a major cause of morbidity and mortality, making early detection critical. AI-driven dermatology systems often rely on skin lesion segmentation as a preprocessing step to delineate the lesion from surrounding skin and support downstream analysis. While fairness concerns regarding skin tone have been widely studied for lesion classification, the influence of skin tone on the segmentation stage remains under-quantified and is frequently assessed using coarse, discrete skin tone categories. In this work, we evaluate three strong segmentation architectures (UNet, DeepLabV3 with a ResNet50 backbone, and DINOv2) on two public dermoscopic datasets (HAM10000 and ISIC2017) and introduce a continuous pigment or contrast analysis that treats pixel-wise ITA values as distributions. Using Wasserstein distances between within-image distributions for skin-only, lesion-only, and whole-image regions, we quantify lesion skin contrast and relate it to segmentation performance across multiple metrics. Within the range represented in these datasets, global skin tone metrics (Fitzpatrick grouping or mean ITA) show weak association with segmentation quality. In contrast, low lesion-skin contrast is consistently associated with larger segmentation errors in models, indicating that boundary ambiguity and low contrast are key drivers of failure. These findings suggest that fairness improvements in dermoscopic segmentation should prioritize robust handling of low-contrast lesions, and the distribution-based pigment measures provide a more informative audit signal than discrete skin-tone categories.
[182] FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing
Fengjian Xue, Xuecheng Wu, Heli Sun, Yunyun Shi, Shi Chen, Liangyu Fu, Jinheng Xie, Dingkang Yang, Hao Wang, Junxiao Xue, Liang He
Main category: cs.CV
TL;DR: FED-Bench: A comprehensive benchmark for facial expression editing with high-quality triplets, evaluation protocol, and training dataset to address current limitations in expression manipulation evaluation.
Details
Motivation: Existing facial expression editing benchmarks lack high-quality images and proper evaluation metrics, with current metrics showing biases toward lazy or overfit editing. There's a need for rigorous testing and accurate evaluation for expression manipulation tasks.Method: 1) Constructed benchmark with 747 triplets (original image, editing instruction, ground-truth) using cascaded scalable pipeline; 2) Introduced FED-Score evaluation protocol with three dimensions: Alignment, Fidelity, and Relative Expression Gain; 3) Benchmarked 18 image editing models; 4) Created 20k+ in-the-wild facial training set.
Result: Current models struggle with simultaneous high fidelity and accurate expression manipulation, with fine-grained instruction following identified as main bottleneck. The benchmark reveals evaluation biases and provides scalable training data that improves baseline models.
Conclusion: FED-Bench addresses critical gaps in facial expression editing evaluation, providing comprehensive benchmark, accurate evaluation protocol, and scalable training data to advance research in precise expression manipulation.
Abstract: Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.
[183] Compressive sensing inspired self-supervised single-pixel imaging
Jijun Lu, Yifan Chen, Libang Chen, Yiqiang Zhou, Ye Zheng, Mingliang Chen, Zhe Sun, Xuelong Li
Main category: cs.CV
TL;DR: SISTA-Net: A self-supervised compressive sensing method for single-pixel imaging that combines CNN-Visual State Space Model architecture with adaptive sparse transforms for robust reconstruction in noisy environments.
Details
Motivation: Existing single-pixel imaging methods lack physical sparsity constraints and fail to integrate local and global features, leading to noise vulnerability, structural distortions, and blurred details in challenging environments.Method: Unfolds Iterative Shrinkage-Thresholding Algorithm (ISTA) into interpretable network with data fidelity module (hybrid CNN-Visual State Space Model) and proximal mapping module using deep nonlinear networks as adaptive sparse transforms with learnable soft-thresholding.
Result: Outperforms state-of-the-art methods by 2.6 dB in PSNR in simulations and achieves 3.4 dB average PSNR improvement in real-world far-field underwater tests, demonstrating robust anti-interference capability.
Conclusion: SISTA-Net effectively addresses limitations of existing SPI methods through physical sparsity constraints and integrated local-global feature modeling, enabling high-quality reconstruction even at extremely low sampling rates in challenging environments.
Abstract: Single-pixel imaging (SPI) is a promising imaging modality with distinctive advantages in strongly perturbed environments. Existing SPI methods lack physical sparsity constraints and overlook the integration of local and global features, leading to severe noise vulnerability, structural distortions and blurred details. To address these limitations, we propose SISTA-Net, a compressive sensing-inspired self-supervised method for single-pixel imaging. SISTA-Net unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) into an interpretable network consisting of a data fidelity module and a proximal mapping module. The fidelity module adopts a hybrid CNN-Visual State Space Model (VSSM) architecture to integrate local and global feature modeling, enhancing reconstruction integrity and fidelity. We leverage deep nonlinear networks as adaptive sparse transforms combined with a learnable soft-thresholding operator to impose explicit physical sparsity in the latent domain, enabling noise suppression and robustness to interference even at extremely low sampling rates. Extensive experiments on multiple simulation scenarios demonstrate that SISTA-Net outperforms state-of-the-art methods by 2.6 dB in PSNR. Real-world far-field underwater tests yield a 3.4 dB average PSNR improvement, validating its robust anti-interference capability.
[184] Leveraging Synthetic Data for Enhancing Egocentric Hand-Object Interaction Detection
Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella
Main category: cs.CV
TL;DR: Synthetic data significantly improves hand-object interaction detection in egocentric vision, especially when real labeled data is scarce, with gains up to +11.69% AP using only 10% real data.
Details
Motivation: Hand-Object Interaction (HOI) detection from egocentric images is challenging due to limited labeled real-world data. Synthetic data offers a promising solution to overcome data scarcity and improve detection performance.Method: Developed a synthetic data generation pipeline and HOI-Synth benchmark that augments existing datasets with synthetic images. Systematically studied synthetic-real alignment across objects, grasps, and environments. Evaluated on VISOR, EgoHOS, and ENIGMA-51 datasets using synthetic data with only 10% real labeled data.
Result: Significant improvements in Overall AP: +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51 compared to models trained exclusively on real data. Effectiveness consistently improves with better synthetic-real alignment.
Conclusion: Synthetic data is highly effective for HOI detection, particularly when real labeled data is scarce. The released HOI-Synth benchmark and generation pipeline provide valuable resources for the community.
Abstract: In this work, we explore the role of synthetic data in improving the detection of Hand-Object Interactions from egocentric images. Through extensive experimentation and comparative analysis on VISOR, EgoHOS, and ENIGMA-51 datasets, our findings demonstrate the potential of synthetic data to significantly improve HOI detection, particularly when real labeled data are scarce or unavailable. By using synthetic data and only 10% of the real labeled data, we achieve improvements in Overall AP over models trained exclusively on real data, with gains of +5.67% on VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Furthermore, we systematically study how aligning synthetic data to specific real-world benchmarks with respect to objects, grasps, and environments, showing that the effectiveness of synthetic data consistently improves with better synthetic-real alignment. As a result of this work, we release a new data generation pipeline and the new HOI-Synth benchmark, which augments existing datasets with synthetic images of hand-object interaction. These data are automatically annotated with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. All data, code, and tools for synthetic data generation are available at: https://fpv-iplab.github.io/HOI-Synth/.
[185] GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis
Thomas Tanay, Mohammed Brahimi, Michal Nazarczuk, Qingwen Zhang, Sibi Catley-Chandar, Arthur Moreau, Zhensong Zhang, Eduardo Pérez-Pellitero
Main category: cs.CV
TL;DR: A novel view synthesis method for dynamic scenes from monocular videos using recurrent loops and plane sweeps to achieve fine-grained 6-DOF camera control with better geometric consistency than existing approaches.
Details
Motivation: Current methods for novel view synthesis from monocular dynamic videos have limitations: scene-specific methods with explicit motion priors fail in highly dynamic regions, diffusion-based approaches suffer from geometric inconsistencies, and both require substantial computational resources.Method: Adapts generalizable models for static novel view synthesis to dynamic inputs with two key components: (1) a recurrent loop enabling unbounded, asynchronous mapping between input and target videos, and (2) efficient plane sweeps over dynamic inputs to disentangle camera and scene motion for fine-grained 6-DOF camera control.
Result: Outperforms four Gaussian Splatting-based scene-specific approaches and two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions. Evaluated on UCSD dataset and new Kubric-4D-dyn dataset featuring longer, higher resolution sequences with complex dynamics.
Conclusion: Proposed method successfully addresses limitations of existing approaches by combining recurrent architecture with plane sweep techniques, achieving better geometric consistency and camera control for dynamic novel view synthesis from monocular videos.
Abstract: Synthesizing novel views from monocular videos of dynamic scenes remains a challenging problem. Scene-specific methods that optimize 4D representations with explicit motion priors often break down in highly dynamic regions where multi-view information is hard to exploit. Diffusion-based approaches that integrate camera control into large pre-trained models can produce visually plausible videos but frequently suffer from geometric inconsistencies across both static and dynamic areas. Both families of methods also require substantial computational resources. Building on the success of generalizable models for static novel view synthesis, we adapt the framework to dynamic inputs and propose a new model with two key components: (1) a recurrent loop that enables unbounded and asynchronous mapping between input and target videos and (2) an efficient use of plane sweeps over dynamic inputs to disentangle camera and scene motion, and achieve fine-grained, six-degrees-of-freedom camera controls. We train and evaluate our model on the UCSD dataset and on Kubric-4D-dyn, a new monocular dynamic dataset featuring longer, higher resolution sequences with more complex scene dynamics than existing alternatives. Our model outperforms four Gaussian Splatting-based scene-specific approaches, as well as two diffusion-based approaches in reconstructing fine-grained geometric details across both static and dynamic regions.
[186] SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark
Rui Bao, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Yang Song, Jiaojiao Jiang
Main category: cs.CV
TL;DR: SHIFT is a training-free attack that exploits the trajectory recovery vulnerability in diffusion-based watermarking by using stochastic resampling to deflect generative trajectories while preserving image quality.
Details
Motivation: Current diffusion-based watermarking methods rely on reconstructing the diffusion trajectory for verification, creating a fundamental vulnerability that can be exploited by attackers.Method: SHIFT uses stochastic diffusion resampling to deflect the generative trajectory in latent space, making reconstructed images statistically decoupled from original watermark-embedded trajectories without requiring watermark-specific knowledge or model retraining.
Result: Achieves 95%–100% attack success rates across nine representative watermarking methods (noise-space, frequency-domain, and optimization-based paradigms) with minimal loss in semantic quality.
Conclusion: SHIFT exposes a critical vulnerability in diffusion-based watermarking and demonstrates that trajectory recovery dependence is a fundamental weakness across diverse watermarking paradigms.
Abstract: Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose $\underline{\mathbf{S}}$tochastic $\underline{\mathbf{Hi}}$dden-Trajectory De$\underline{\mathbf{f}}$lec$\underline{\mathbf{t}}$ion ($\mathbf{SHIFT}$), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%–100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.
[187] TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
Qiucheng Yu, Ruijie Xu, Mingang Chen, Xuequan Lu, Jianfeng Dong, Chaochao Lu, Xin Tan
Main category: cs.CV
TL;DR: TSHA is a comprehensive benchmark for indoor safety hazards assessment using vision-language models, addressing limitations of existing synthetic datasets and oversimplified tasks with real-world data from multiple sources.
Details
Motivation: Existing benchmarks for indoor safety hazards assessment rely heavily on synthetic datasets, creating domain gaps with real-world environments, oversimplify safety tasks with artificial constraints, and lack rigorous evaluation protocols for complex scenarios.Method: Created TSHA benchmark with 81,809 training samples from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. Includes challenging test set with 1,707 samples including videos and panoramic images with multiple safety hazards.
Result: Experiments on 23 popular VLMs show current models lack robust safety hazard assessment capabilities. Models trained on TSHA achieve up to +18.3 point improvement on TSHA test set and exhibit enhanced generalizability across other benchmarks.
Conclusion: TSHA benchmark addresses critical limitations in indoor safety assessment, demonstrates the need for better training data, and shows that models trained on comprehensive real-world data significantly outperform current approaches.
Abstract: Recent advances in vision-language models (VLMs) have accelerated their application to indoor safety hazards assessment. However, existing benchmarks suffer from three fundamental limitations: (1) heavy reliance on synthetic datasets constructed via simulation software, creating a significant domain gap with real-world environments; (2) oversimplified safety tasks with artificial constraints on hazard and scene types, thereby limiting model generalization; and (3) absence of rigorous evaluation protocols to thoroughly assess model capabilities in complex home safety scenarios. To address these challenges, we introduce TSHA (\textbf{T}rustworthy \textbf{S}afety \textbf{H}azards \textbf{A}ssessment), a comprehensive benchmark comprising 81,809 carefully curated training samples drawn from four complementary sources: existing indoor datasets, internet images, AIGC images, and newly captured images. This benchmark set also includes a highly challenging test set with 1707 samples, comprising not only a carefully selected subset from the training distribution but also newly added videos and panoramic images containing multiple safety hazards, used to evaluate the model’s robustness in complex safety scenarios. Extensive experiments on 23 popular VLMs demonstrate that current VLMs lack robust capabilities for safety hazard assessment. Importantly, models trained on the TSHA training set not only achieve a significant performance improvement of up to +18.3 points on the TSHA test set but also exhibit enhanced generalizability across other benchmarks, underscoring the substantial contribution and importance of the TSHA benchmark.
[188] Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
Fengyang Xiao, Peng Hu, Lei Xu, XingE Guo, Guanyi Qin, Yuqi Shen, Chengyu Fang, Rihan Zhang, Chunming He, Sina Farsiu
Main category: cs.CV
TL;DR: IQPIR: A novel image restoration framework that uses Image Quality Prior from NR-IQA models to guide restoration toward perceptually optimal outputs, integrating quality conditioning with codebook priors.
Details
Motivation: Existing image restoration methods rely on ground-truth supervision which can have inconsistent perceptual fidelity, causing models to converge to average quality rather than achieving the highest perceptual quality attainable.Method: Proposes IQPIR framework with three key mechanisms: 1) quality-conditioned Transformer using NR-IQA scores as conditioning signals, 2) dual-branch codebook structure disentangling common and HQ-specific features, and 3) discrete representation-based quality optimization to mitigate over-optimization.
Result: Extensive experiments on real-world image restoration demonstrate the method surpasses cutting-edge methods and serves as a generalizable quality-guided enhancement strategy for existing methods.
Conclusion: IQPIR effectively addresses limitations of ground-truth supervision by incorporating image quality priors to achieve higher perceptual quality in image restoration tasks.
Abstract: Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.
[189] From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety
Ganen Sethupathy, Lalit Dumka, Jan Schagen
Main category: cs.CV
TL;DR: Hybrid edge-based action detection system combining skeleton-based motion analysis with vision-language models for public safety applications, focusing on system-level trade-offs under edge constraints.
Details
Motivation: Public safety requires timely violence detection in public spaces, but current automated video analysis faces latency, privacy, and resource limitations, especially in edge-computing environments.Method: System-level comparison of skeleton-based motion analysis (privacy-aware, low overhead) with vision-language models (contextual understanding, zero-shot reasoning) implemented on GPU-enabled edge device, evaluated for latency and resource usage.
Result: Results show complementary strengths: skeleton-based detection is fast and privacy-preserving, while vision-language models provide semantic understanding; hybrid architecture selectively combines both approaches.
Conclusion: Hybrid edge-based system provides practical foundation for privacy-aware, real-time video analysis in public safety by leveraging both motion-centric and semantic approaches.
Abstract: Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.
[190] MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification
Boshko Koloski, Marjan Stoimchev, Jurica Levatić, Dragi Kocev, Sašo Džeroski
Main category: cs.CV
TL;DR: MAPLE is a hierarchical multi-label classification framework for remote sensing that handles multi-path label dependencies using hierarchical semantic initialization, graph-based encoding, and adaptive multimodal fusion.
Details
Motivation: Existing hierarchical multi-label classification approaches struggle with multi-path settings where images activate multiple taxonomic branches, leading to underuse of hierarchical information in remote sensing applications.Method: MAPLE integrates: 1) hierarchical semantic initialization from graph-aware textual descriptions, 2) graph-based structure encoding via GCNs, and 3) adaptive multimodal fusion that balances semantic priors and visual evidence with level-aware loss selection.
Result: Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, MLRSNet) show consistent improvements up to +42% in few-shot regimes with only 2.6% parameter overhead.
Conclusion: MAPLE effectively and efficiently models hierarchical semantics for Earth observation by better handling multi-path label dependencies through adaptive multimodal fusion and hierarchical semantic initialization.
Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).
[191] Multi-Feature Fusion Approach for Generative AI Images Detection
Abderrezzaq Sendjasni, Mohamed-Chaker Larabi
Main category: cs.CV
TL;DR: Multi-feature fusion framework combining statistical, semantic, and texture features for robust detection of AI-generated images across diverse generative models.
Details
Motivation: Existing AI-generated image detectors often rely on single-feature spaces and lack robustness against diverse and evolving generative models, necessitating more comprehensive detection approaches.Method: Proposes a multi-feature fusion framework combining three complementary feature spaces: MSCN features for low-level statistical deviations, CLIP embeddings for high-level semantic coherence, and MLBP for mid-level texture anomalies.
Result: The fusion of all three representations yields superior and more consistent performance across four benchmark datasets, particularly in challenging mixed-model scenarios, outperforming state-of-the-art methods.
Conclusion: Hybrid representations combining multiple visual cues are crucial for robust GenAI image detection, providing a principled framework for integrating complementary features.
Abstract: The rapid evolution of Generative AI (GenAI) models has led to synthetic images of unprecedented realism, challenging traditional methods for distinguishing them from natural photographs. While existing detectors often rely on single-feature spaces, such as statistical regularities, semantic embeddings, or texture patterns, these approaches tend to lack robustness when confronted with diverse and evolving generative models. In this work, we investigate and systematically evaluate a multi-feature fusion framework that combines complementary cues from three distinct spaces: (1) Mean Subtracted Contrast Normalized (MSCN) features capturing low-level statistical deviations; (2) CLIP embeddings encoding high-level semantic coherence; and (3) Multi-scale Local Binary Patterns (MLBP) characterizing mid-level texture anomalies. Through extensive experiments on four benchmark datasets covering a wide range of generative models, we show that individual feature spaces exhibit significant performance variability across different generators. Crucially, the fusion of all three representations yields superior and more consistent performance, particularly in a challenging mixed-model scenario. Compared to state-of-the-art methods, the proposed framework yields consistently improved performance across all evaluated datasets. Overall, this work highlights the importance of hybrid representations for robust GenAI image detection and provides a principled framework for integrating complementary visual cues.
[192] SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes
Léopold Maillard, Francis Engelmann, Tom Durand, Boxiao Pan, Yang You, Or Litany, Leonidas Guibas, Maks Ovsjanikov
Main category: cs.CV
TL;DR: SceneTeract is a framework for verifying 3D scene functionality for embodied AI agents by combining semantic reasoning with geometric checks to validate accessibility requirements and physical feasibility.
Details
Motivation: Embodied AI needs interactive 3D environments that support meaningful activities, but assessing functional affordances remains challenging. Current approaches often lack verification of physical feasibility and accessibility constraints for specific agents.Method: SceneTeract uses a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. It decomposes complex activities into atomic actions and validates each step against accessibility requirements (reachability, clearance, navigability) using physical and geometric simulations conditioned on agent profiles.
Result: The framework revealed frequent functional failures in synthetic indoor environments and systematic mismatches between semantic confidence and physical feasibility in frontier Vision-Language Models. It was also successfully used as a reward engine for VLM post-training to distill geometric constraints into reasoning models.
Conclusion: SceneTeract bridges perception and physical reality in embodied 3D scene understanding, providing a verification suite that can improve the functional reasoning capabilities of multimodal models in interactive environments.
Abstract: Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.
[193] AutoFormBench: Benchmark Dataset for Automating Form Understanding
Gaurab Baral, Junxiu Zhou
Main category: cs.CV
TL;DR: AutoFormBench benchmark dataset for form element detection with comparison of OpenCV and YOLO architectures, showing YOLOv11’s superior performance
Details
Motivation: Automated processing of structured documents like forms, healthcare records, and invoices is challenging due to layout variability in real-world settings, requiring robust form element detection methods.Method: Created AutoFormBench dataset of 407 annotated real-world forms across government, healthcare, and enterprise domains. Compared classical OpenCV approaches with four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, YOLOv26-l) for detecting checkboxes, input lines, and text boxes in PDF documents.
Result: YOLOv11 demonstrated consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels compared to other YOLO variants and OpenCV approaches.
Conclusion: The AutoFormBench benchmark enables effective training and evaluation of form element detection models, with YOLOv11 emerging as the most effective architecture for this document understanding task.
Abstract: Automated processing of structured documents such as government forms, healthcare records, and enterprise invoices remains a persistent challenge due to the high degree of layout variability encountered in real-world settings. This paper introduces AutoFormBench, a benchmark dataset of 407 annotated real-world forms spanning government, healthcare, and enterprise domains, designed to train and evaluate form element detection models. We present a systematic comparison of classical OpenCV approaches and four YOLO architectures (YOLOv8, YOLOv11, YOLOv26-s, and YOLOv26-l) for localizing and classifying fillable form elements. specifically checkboxes, input lines, and text boxes across diverse PDF document types. YOLOv11 demonstrates consistently superior performance in both F1 score and Jaccard accuracy across all element classes and tolerance levels.
[194] Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
Minyoung E. Kim, Dae Hee Yun, Aditi V. Patel, Madeline Hon, Webster Guan, Taegeon Lee, Brian Nguyen
Main category: cs.CV
TL;DR: CANVAS is a comprehensive benchmark dataset of high-resolution whole mouse brain light-sheet fluorescence microscopy data with six cell-type markers and extensive cell annotations, created to accelerate development of methods for analyzing petabyte-scale subcellular-resolution 3D microscopy data.
Details
Motivation: Recent advances in tissue processing and microscopy have enabled unprecedented subcellular-resolution whole-brain 3D imaging, but existing computer vision models struggle to generalize to these petabyte-scale datasets, creating a bottleneck for biological interpretation.Method: The authors created CANVAS - a benchmark dataset comprising high-resolution whole mouse brain light-sheet fluorescence microscopy data with six neuronal and immune cell-type markers, extensive cell annotations throughout the brain, and a leaderboard for method evaluation.
Result: CANVAS is presented as the first and largest LSFM benchmark capturing intact mouse brain tissue at subcellular level, with demonstrations showing that baseline models built on existing architectures struggle with generalization due to heterogeneity in cellular morphology across phenotypes and brain regions.
Conclusion: The CANVAS benchmark addresses the critical need for scalable data processing and analysis methods for petabyte-scale 3D microscopy data, providing a foundation for developing specialized models that can handle the unique challenges of subcellular-resolution whole-brain imaging.
Abstract: Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.
[195] Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification
Hiba Adil Al-kharsan, Róbert Rajkó
Main category: cs.CV
TL;DR: A robust handwritten digit classification framework combining diffusion-based feature denoising with hybrid feature representations (NNMF + CNN) to improve robustness against noise and adversarial attacks.
Details
Motivation: To develop a robust multi-class classification system for handwritten digits that can withstand noise and adversarial attacks, inspired by previous work on brain tumor classification. The goal is to improve reliability in real-world scenarios where input quality may be compromised.Method: 1. Extract hybrid features: NNMF for interpretable exemplification + CNN for deep features. 2. Apply diffusion process in feature space by gradually adding Gaussian noise. 3. Train feature denoiser network to reverse noise and reconstruct clean representations. 4. Use denoised features for multi-class classification. 5. Evaluate using AutoAttack in both baseline and adversarial settings.
Result: The diffusion-based hybrid model outperforms CNN baseline models while maintaining strong classification performance. The method demonstrates effectiveness and robustness against adversarial attacks in handwritten digit classification tasks.
Conclusion: Feature-level diffusion defense is effective for reliable multi-class handwritten digit classification, showing that combining diffusion processes with hybrid feature representations can significantly improve robustness to noise and adversarial attacks.
Abstract: This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.
[196] Training deep learning based dynamic MR image reconstruction using synthetic fractals
Anirudh Raman, Olivier Jaubert, Mark Wrobel, Tina Yao, Ruaraidh Campbell, Rebecca Baker, Ruta Virsinskaite, Daniel Knight, Michael Quail, Jennifer Steeden, Vivek Muthurangu
Main category: cs.CV
TL;DR: Synthetic fractal data can effectively train DL models for dynamic MRI reconstruction, matching performance of models trained on real cardiac MRI data while avoiding privacy/availability issues.
Details
Motivation: To overcome privacy, licensing, and availability limitations of cardiac MRI training datasets by using synthetically generated fractal data for training deep learning reconstruction models.Method: Generated 2D+time images using quaternion Julia fractals, simulated multi-coil MRI acquisition to create paired fully/undersampled k-space data, trained 3D UNet models on fractal data (F-DL) and compared with identical model trained on cardiac MRI data (CMR-DL).
Result: No significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), both outperformed compressed sensing and low-rank deep image prior. Ventricular volumes/function from F-DL similar to CMR-DL with acceptable agreement to reference cine imaging.
Conclusion: DL models trained on synthetic fractal data can reconstruct real-time cardiac MRI with comparable image quality and clinical measurements to models trained on real cardiac MRI, providing an open, scalable alternative to clinical datasets.
Abstract: Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p<0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.
[197] Abstraction in Style
Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.29924: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29924&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[198] VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing
Juan Rodriguez, Haotian Zhang, Abhay Puri, Tianyang Zhang, Rishav Pramanik, Meng Lin, Xiaoqing Xie, Marco Terral, Darsh Kaushik, Aly Shariff, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2603.29852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[199] GENIE: Gram-Eigenmode INR Editing with Closed-Form Geometry Updates
Samundra Karki, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.29860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[200] End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines
Raül Pérez-Gonzalo, Andreas Espersen, Søren Forchhammer, Antonio Agudo
Main category: cs.CV
TL;DR: A deep learning framework for wind turbine inspection images that jointly performs blade segmentation and dual-mode (lossy/lossless) compression, prioritizing blade regions over background for efficient transmission while preserving defect detection quality.
Details
Motivation: Transferring high-resolution wind turbine inspection images creates bottlenecks; need efficient compression that preserves blade region fidelity for defect detection while aggressively compressing background.Method: End-to-end framework with: 1) BU-Netv2+P segmentation with CRF-regularized loss for blade localization, 2) hyperprior-based autoencoder for lossy compression, 3) extended bits-back coder with hierarchical models for lossless blade reconstruction, and 4) parallelized dual-mode compression by reusing background-coded bits.
Result: Superior compression performance and efficiency on large-scale wind turbine dataset, offering practical solution for automated inspections without compromising defect detection.
Conclusion: First fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression for wind turbine inspection, enabling efficient transmission while preserving critical blade region quality.
Abstract: Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.
[201] VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães
Main category: cs.CV
TL;DR: VIGiA is a multimodal dialogue model for understanding and reasoning over complex instructional video action plans, supporting grounded, plan-aware dialogue that integrates visual inputs, instructional plans, and user interactions.
Details
Motivation: Prior work focuses mainly on text-only guidance or treats vision and language in isolation, lacking the ability to handle complex, multi-step instructional video action plans that require reasoning over visual inputs, instructional plans, and interleaved user interactions.Method: VIGiA incorporates two key capabilities: (1) multimodal plan reasoning to align uni- and multimodal queries with current task plans and respond accurately, and (2) plan-based retrieval to retrieve relevant plan steps in either textual or visual representations.
Result: VIGiA outperforms existing state-of-the-art models on all tasks in conversational plan guidance settings, reaching over 90% accuracy on plan-aware VQA on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans.
Conclusion: VIGiA demonstrates strong capabilities in multimodal dialogue for instructional video understanding, effectively integrating visual and textual information with plan reasoning for complex task guidance.
Abstract: We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.
[202] Gloria: Consistent Character Video Generation via Content Anchors
Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Main category: cs.CV
TL;DR: CharacterVideo: A method for generating long-duration character videos with consistent multi-view appearance using compact anchor frames and novel mechanisms to prevent copy-pasting and multi-reference conflicts.
Details
Motivation: Digital characters are central to modern media, but generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or use non-character-centric information as memory, leading to suboptimal consistency.Method: Proposes representing character visual attributes through a compact set of anchor frames that provide stable references for consistency. Introduces two key mechanisms: 1) Superset Content Anchoring - provides intra- and extra-training clip cues to prevent duplication, and 2) RoPE as Weak Condition - encodes positional offsets to distinguish multiple anchors. Also constructs a scalable pipeline to extract anchors from massive videos.
Result: The method generates high-quality character videos exceeding 10 minutes in duration, achieving expressive identity and appearance consistency across views, surpassing existing methods.
Conclusion: The proposed approach successfully addresses challenges in character video generation by using anchor frames with novel mechanisms to ensure consistency and prevent common issues like copy-pasting and multi-reference conflicts, enabling long-duration, high-quality character video generation.
Abstract: Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
[203] Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Vanessa Emanuela Guarino, Claudia Winklmayr, Jannik Franzen, Josef Lorenz Rumberger, Manuel Pfeuffer, Sonja Greven, Klaus Maier-Hein, Carsten T. Lüth, Christoph Karg, Dagmar Kainmueller
Main category: cs.CV
TL;DR: Systematic analysis of uncertainty aggregation strategies for image segmentation, proposing novel spatial-aware aggregators and a meta-aggregator for robust performance across datasets.
Details
Motivation: Current uncertainty quantification in image segmentation lacks systematic comparison of aggregation strategies, leading to inconsistent practices and suboptimal performance in downstream tasks like OoD and failure detection.Method: Formal analysis of existing aggregation strategies, proposal of novel spatial-aware strategies, benchmarking across 10 diverse datasets, and development of a meta-aggregator that integrates multiple approaches.
Result: Spatial-aware aggregators outperform traditional methods like Global Average in OoD and failure detection tasks, but performance varies with dataset characteristics. The meta-aggregator achieves robust performance across all datasets.
Conclusion: Aggregation strategy choice significantly impacts downstream task performance in segmentation uncertainty quantification. Spatial-aware methods are generally superior, and a meta-aggregator provides dataset-agnostic robustness.
Abstract: Uncertainty Quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. In segmentation, UQ generates pixel-wise uncertainty scores that must be aggregated into image-level scores for downstream tasks like Out-of-Distribution (OoD) or failure detection. Despite routine use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied. Global Average is the default choice, yet it does not account for spatial and structural features of segmentation uncertainty. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices. We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure. We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, the performance of individual aggregators depends heavily on dataset characteristics, so we (4) propose a meta-aggregator that integrates multiple aggregators and performs robustly across datasets.
[204] EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos
Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, Yutaka Matsuo
Main category: cs.CV
TL;DR: EC-Bench introduces a new benchmark for evaluating enumeration, counting, and temporal evidence grounding in long-form videos (30+ minutes), revealing significant gaps between current MLLMs and human performance.
Details
Motivation: Current video counting benchmarks focus on short clips and only evaluate final numerical answers, lacking insight into what should be counted or whether models consistently identify relevant instances across time. There's a need for benchmarks that evaluate long-range temporal reasoning in videos with sparse, diverse events.Method: Created EC-Bench with 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Evaluated 22 multimodal large language models (MLLMs) on enumeration, counting, and temporal evidence grounding tasks.
Result: Best MLLM achieved only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reached 78.57% and 82.97% respectively. Strong relationships were found between enumeration accuracy, temporal grounding, and counting performance.
Conclusion: EC-Bench reveals fundamental limitations of current MLLMs in long-form quantitative video reasoning and establishes a challenging benchmark for future research in multimodal video understanding.
Abstract: Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.
[205] Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park
Main category: cs.CV
TL;DR: DEUS is a novel framework for Open World Object Detection that improves unknown object discovery and mitigates catastrophic forgetting through ETF-based subspace separation and energy-based distinction.
Details
Motivation: Existing OWOD methods struggle with effectively learning unknown object representations due to heavy reliance on known class predictions, and memory replay techniques often sacrifice knowledge of newly learned classes while mitigating forgetting of old ones.Method: DEUS consists of two components: 1) ETF-Subspace Unknown Separation (EUS) that creates orthogonal subspaces using Equiangular Tight Frame geometric properties for cleaner separation between known and unknown objects, and 2) Energy-based Known Distinction (EKD) loss that enforces separation between previous and current classifiers to minimize knowledge interference during memory replay.
Result: DEUS demonstrates outstanding performance improvements in unknown detection while maintaining competitive known class performance on OWOD benchmarks.
Conclusion: The proposed DEUS framework effectively addresses key challenges in Open World Object Detection by improving unknown object discovery and reducing catastrophic forgetting through novel subspace separation and energy-based distinction techniques.
Abstract: In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector’s known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.
[206] SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie, Wanjun Guo, Tao Li, Haiteng Jiang
Main category: cs.CV
TL;DR: SleepVLM is a vision-language model for sleep staging from PSG waveform images that generates clinician-readable rationales based on AASM scoring criteria, achieving state-of-the-art performance while providing transparent explanations.
Details
Motivation: While automated sleep staging has achieved expert-level accuracy, clinical adoption is hindered by lack of auditable reasoning. There's a need for transparent, interpretable models that can generate clinician-readable rationales to improve trustworthiness in clinical workflows.Method: SleepVLM uses waveform-perceptual pre-training and rule-grounded supervised fine-tuning to stage sleep from multi-channel polysomnography (PSG) waveform images. It generates rationales based on American Academy of Sleep Medicine (AASM) scoring criteria.
Result: Achieved Cohen’s kappa scores of 0.767 on held-out test set (MASS-SS1) and 0.743 on external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations gave mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Conclusion: SleepVLM couples competitive performance with transparent, rule-based explanations, potentially improving trustworthiness and auditability of automated sleep staging in clinical workflows. The authors also release MASS-EX, an expert-annotated dataset for interpretable sleep medicine research.
Abstract: While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) designed to stage sleep from multi-channel polysomnography (PSG) waveform images while generating clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen’s kappa scores of 0.767 on an held out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Expert evaluations further validated the quality of the model’s reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.
[207] NeuroBRIDGE: Behavior-Conditioned Koopman Dynamics with Riemannian Alignment for Early Substance Use Initiation Prediction from Longitudinal Functional Connectome
Badhan Mazumder, Sir-Lord Wiafe, Vince D. Calhoun, Dong Hye Ye
Main category: cs.CV
TL;DR: NeuroBRIDGE: A graph neural network framework for predicting adolescent substance use initiation by modeling longitudinal brain connectivity dynamics with behavioral-conditioned Koopman dynamics.
Details
Motivation: Early identification of adolescents at risk for substance use initiation is crucial but challenging because most existing methods treat brain connectivity as static or cross-sectional, missing how brain networks change over time and with behavior.Method: Proposed NeuroBRIDGE framework that aligns longitudinal functional connectomes in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal changes in brain networks.
Result: Evaluated on ABCD dataset, NeuroBRIDGE improved future substance use initiation prediction over relevant baselines while providing interpretable insights into neural pathways.
Conclusion: The framework refines understanding of neurodevelopmental risk and informs targeted prevention strategies for adolescent substance use.
Abstract: Early identification of adolescents at risk for substance use initiation (SUI) is vital yet difficult, as most predictors treat connectivity as static or cross-sectional and miss how brain networks change over time and with behavior. We proposed NeuroBRIDGE (Behavior conditioned RIemannian Koopman Dynamics on lonGitudinal connEctomes), a novel graph neural network-based framework that aligns longitudinal functional connectome in a Riemannian tangent space and couples dual-time attention with behavioral-conditioned Koopman dynamics to capture temporal change. Evaluated on ABCD, NeuroBRIDGE improved future SUI prediction over relevant baselines while offering interpretable insights into neural pathways, refining our understanding of neurodevelopmental risk and informing targeted prevention.
[208] ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning
Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Ben Wang, Jun Zhao, Kun Xu, Kang Liu
Main category: cs.CV
TL;DR: ResAdapt is an input-side adaptation framework that learns to allocate visual budget per frame before encoding, enabling MLLMs to handle higher spatial resolution and longer temporal contexts without increasing visual token count.
Details
Motivation: Current MLLMs struggle with visual token growth when scaling input fidelity (higher resolution/longer videos), making it prohibitive to sustain both high spatial resolution and long temporal context simultaneously.Method: Uses a lightweight Allocator coupled with unchanged MLLM backbone; formulates allocation as contextual bandit problem; trains with Cost-Aware Policy Optimization (CAPO) to convert sparse rollout feedback into accuracy-cost learning signal.
Result: Improves low-budget operating points, often lies on/near efficiency-accuracy frontier, supports up to 16x more frames at same visual budget with over 15% performance gain on reasoning-intensive benchmarks.
Conclusion: ResAdapt addresses the visual token bottleneck by learning optimal per-frame visual budget allocation before encoding, enabling more efficient multimodal understanding without modifying MLLM backbone.
Abstract: Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.
[209] SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy
Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy
Main category: cs.CV
TL;DR: SurgTEMP: A multimodal LLM framework for surgical video question answering with hierarchical visual memory and surgical competency progression training, evaluated on new CholeVidQA-32K dataset.
Details
Motivation: Current surgical VQA research focuses on static frames, missing rich temporal semantics. Surgical video QA faces challenges like low visual contrast, knowledge-driven nature, diverse analytical needs across temporal windows, and hierarchy from perception to assessment.Method: Proposes SurgTEMP with (i) query-guided token selection module building hierarchical visual memory (spatial and temporal memory banks) and (ii) Surgical Competency Progression (SCP) training scheme for modeling variable-length surgical videos while preserving procedure-relevant cues and temporal coherence.
Result: Achieves substantial performance improvements against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot) on CholeVidQA-32K dataset, advancing video-based surgical VQA.
Conclusion: SurgTEMP effectively addresses challenges in surgical video QA through hierarchical visual memory and progressive training, demonstrating strong performance on diverse surgical assessment tasks.
Abstract: Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy – Perception, Assessment, and Reasoning – spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.
[210] Scaling Video Pretraining for Surgical Foundation Models
Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu
Main category: cs.CV
TL;DR: SurgRec is a scalable and reproducible pretraining recipe for surgical video understanding with two variants (MAE and JEPA), trained on a large multi-source corpus of 10,535 surgical videos across various modalities, achieving superior performance on 16 downstream datasets compared to SSL baselines and vision-language models.
Details
Motivation: Existing surgical foundation models are limited by small data scale, lack of procedural diversity, inconsistent evaluation, and absence of reproducible training pipelines. There's a need for scalable, reproducible approaches to surgical video understanding that can handle diverse surgical modalities and provide standardized benchmarks.Method: Proposes SurgRec with two variants: SurgRec-MAE (masked autoencoder) and SurgRec-JEPA (joint embedding predictive architecture). Curates a large multi-source corpus of 10,535 videos (214.5M frames) spanning endoscopy, laparoscopy, cataract, and robotic surgery. Develops a unified pretraining pipeline with balanced sampling and standardizes a reproducible benchmark across 16 downstream datasets with consistent data splits.
Result: SurgRec consistently achieves superior performance across downstream datasets compared to self-supervised learning baselines and vision-language models. VLMs prove unreliable for fine-grained temporal recognition, showing performance gaps and sensitivity to prompt phrasing. The approach provides a reproducible foundation for the community.
Conclusion: SurgRec offers a scalable, reproducible pretraining recipe for surgical video understanding that outperforms existing approaches. The work provides standardized benchmarks and releases all code, models, and data to enable community advancement in surgical video modeling.
Abstract: Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
[211] Learning Structural-Functional Brain Representations through Multi-Scale Adaptive Graph Attention for Cognitive Insight
Badhan Mazumder, Sir-Lord Wiafe, Aline Kotoski, Vince D. Calhoun, Dong Hye Ye
Main category: cs.CV
TL;DR: MAGNet is a Transformer-style graph neural network that jointly models brain structure and function by integrating structural MRI features with functional connectivity, using adaptive attention mechanisms for improved cognitive function prediction.
Details
Motivation: The paper aims to address the challenge of jointly modeling brain structure and function, which capture complementary aspects of brain organization but are difficult to integrate effectively for understanding cognitive function.Method: Proposes Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style GNN framework that: 1) extracts inter-regional morphological features from structural MRI using source-based morphometry, 2) fuses them with functional network connectivity from resting-state fMRI, 3) uses a hybrid graph integrating direct and indirect pathways, 4) employs local-global attention to refine connectivity importance, and 5) uses a joint loss enforcing cross-modal coherence while optimizing prediction objectives end-to-end.
Result: On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing understanding of cognitive function.
Conclusion: MAGNet provides an effective framework for jointly modeling brain structure and function, enabling better understanding of cognitive function through adaptive learning of structure-function interactions.
Abstract: Understanding how brain structure and function interact is key to explaining intelligence yet modeling them jointly is challenging as the structural and functional connectome capture complementary aspects of organization. We introduced Multi-scale Adaptive Graph Network (MAGNet), a Transformer-style graph neural network framework that adaptively learns structure-function interactions. MAGNet leverages source-based morphometry from structural MRI to extract inter-regional morphological features and fuses them with functional network connectivity from resting-state fMRI. A hybrid graph integrates direct and indirect pathways, while local-global attention refines connectivity importance and a joint loss simultaneously enforces cross-modal coherence and optimizes the prediction objective end-to-end. On the ABCD dataset, MAGNet outperformed relevant baselines, demonstrating effective multimodal integration for advancing our understanding of cognitive function.
[212] Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI
Iain Swift, JingHua Ye
Main category: cs.CV
TL;DR: This pilot study extends brain tumor survival prediction from bimodal (histopathology + genomics) to trimodal by adding FLAIR MRI, finding modest improvements but limited statistical significance due to small sample size.
Details
Motivation: While multimodal deep learning has improved brain tumor prognosis by integrating histopathology and genomic data, the contribution of volumetric MRI within unified survival prediction frameworks remains unexplored. The study aims to investigate whether adding MRI as a third modality can enhance prognostic accuracy.Method: Extends a bimodal framework by incorporating FLAIR MRI from BraTS2021 as a third modality. Uses TCGA-GBMLGG cohort (664 patients) to evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. All MRI-containing experiments are constrained to 19 test patients due to data availability.
Result: Trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled ΔCS of +0.011 over the bimodal baseline, though not statistically significant (p = 0.250). MRI achieves reasonable unimodal discrimination (CS = 0.755) but doesn’t substantially improve bimodal pairs, while providing measurable uplift in three-way combinations. Wide bootstrap confidence intervals (e.g., [0.400,1.000]) preclude definitive conclusions due to small sample size.
Conclusion: The study provides preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively. However, larger datasets are needed for conclusive findings.
Abstract: Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $Δ$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.
[213] SurgNavAR: An Augmented Reality Surgical Navigation Framework for Optical See-Through Head Mounted Displays
Abdullah Thabit, Mohamed Benmahdjoub, Rafiuddin Jinabade, Hizirwan S. Salim, Marie-Lise C. van Veelen, Mark G. van Vledder, Eppo B. Wolvius, Theo van Walsum
Main category: cs.CV
TL;DR: An integrated HMD-based AR surgical navigation framework for real-time visualization of preoperative imaging data during surgery, achieving sub-5mm targeting accuracy.
Details
Motivation: Current HMD-AR devices lack integrated surgical navigation capabilities, requiring complex integration of tracking, registration, and visualization technologies that hampers scientific progress in AR surgical applications.Method: Framework tracks 2D patterns as reference markers on patients and instruments, uses pivot/reference-based tool calibration, point-based matching for image-to-patient registration, and manual positioning. Evaluated on HoloLens 2 and Magic Leap 2 with phantom setups for needle insertion and rib fracture localization.
Result: Achieved mean tooltip calibration accuracy of 1 mm, registration accuracy of 3 mm, and targeting accuracy below 5 mm for both surgical use cases on both HMD devices.
Conclusion: The framework provides an easy-to-use, configurable tool for HMD-based AR surgical navigation that can be extended to many surgical applications, with code publicly available.
Abstract: Augmented reality (AR) devices with head mounted displays (HMDs) facilitate the direct superimposition of 3D preoperative imaging data onto the patient during surgery. To use an HMD-AR device as a stand-alone surgical navigation system, the device should be able to locate the patient and surgical instruments, align preoperative imaging data with the patient, and visualize navigation data in real time during surgery. Whereas some of the technologies required for this are known, integration in such devices is cumbersome and requires specific knowledge and expertise, hampering scientific progress in this field. This work therefore aims to present and evaluate an integrated HMD-based AR surgical navigation framework that is adaptable to diverse surgical applications. The framework tracks 2D patterns as reference markers attached to the patient and surgical instruments. It allows for the calibration of surgical tools using pivot and reference-based calibration techniques. It enables image-to-patient registration using point-based matching and manual positioning. The integrated functionalities of the framework are evaluated on two HMD devices, the HoloLens 2 and Magic Leap 2, with two surgical use cases being evaluated in a phantom setup: AR-guided needle insertion and rib fracture localization. The framework was able to achieve a mean tooltip calibration accuracy of 1 mm, a registration accuracy of 3 mm, and a targeting accuracy below 5 mm on the two surgical use cases. The framework presents an easy-to-use configurable tool for HMD-based AR surgical navigation, which can be extended and adapted to many surgical applications. The framework is publicly available at https://github.com/abdullahthabit/SurgNavAR.
[214] Conditional Polarization Guidance for Camouflaged Object Detection
QIfan Zhang, Hao Wang, Xiangrong Qin, Ruijie Li
Main category: cs.CV
TL;DR: CPGNet is an asymmetric RGB-polarization framework for camouflaged object detection that uses conditional polarization guidance to explicitly regulate RGB feature learning, with a lightweight polarization interaction module and polarization edge-guided frequency refinement.
Details
Motivation: Existing polarization-based COD methods rely on complex visual encoders and fusion mechanisms, increasing model complexity while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. There's a need for more efficient and effective polarization guidance mechanisms.Method: Proposes CPGNet with: 1) Lightweight polarization interaction module for joint modeling of complementary cues, 2) Conditional polarization guidance mechanism that dynamically modulates RGB features using polarization priors, 3) Polarization edge-guided frequency refinement to enhance high-frequency components, 4) Iterative feedback decoder for coarse-to-fine feature calibration.
Result: Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.
Conclusion: CPGNet effectively addresses limitations of existing polarization-based COD methods by introducing explicit polarization guidance mechanisms that improve detection performance while maintaining efficiency.
Abstract: Camouflaged object detection (COD) aims to identify targets that are highly blended with their backgrounds. Recent works have shown that the optical characteristics of polarization cues play a significant role in improving camouflaged object detection. However, most existing polarization-based approaches depend on complex visual encoders and fusion mechanisms, leading to increased model complexity and computational overhead, while failing to fully explore how polarization can explicitly guide hierarchical RGB representation learning. To address these limitations, we propose CPGNet, an asymmetric RGB-polarization framework that introduces a conditional polarization guidance mechanism to explicitly regulate RGB feature learning for camouflaged object detection. Specifically, we design a lightweight polarization interaction module that jointly models these complementary cues and generates reliable polarization guidance in a unified manner. Unlike conventional feature fusion strategies, the proposed conditional guidance mechanism dynamically modulates RGB features using polarization priors, enabling the network to focus on subtle discrepancies between camouflaged objects and their backgrounds. Furthermore, we introduce a polarization edge-guided frequency refinement strategy that enhances high-frequency components under polarization constraints, effectively breaking camouflage patterns. Finally, we develop an iterative feedback decoder to perform coarse-to-fine feature calibration and progressively refine camouflage prediction. Extensive experiments on polarization datasets across multiple tasks, along with evaluations on non-polarization datasets, demonstrate that CPGNet consistently outperforms state-of-the-art methods.
[215] Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
Wenyi Li, Renkai Luo, Yue Yu, Huan-ang Gao, Mingju Gao, Li Yuan, Chaoyou Fu, Hao Zhao
Main category: cs.CV
TL;DR: GeoCodeBench is a PhD-level benchmark for evaluating AI coding capabilities in 3D geometric vision, revealing current models struggle with only 36.6% pass rate for complex 3D vision tasks.
Details
Motivation: Current AI coding models struggle with complex 3D geometric vision tasks, limiting their utility for research workflows. The community needs a rigorous benchmark to measure progress toward reliable 3D scientific coding.Method: Created GeoCodeBench with fill-in-the-function tasks curated from recent 3D vision papers. Used tool-assisted extraction from official repositories followed by human screening. Generated diverse edge-case unit tests for automatic scoring. Evaluated 8 representative models across two capability levels: General 3D (transformations, mechanics/optics) and Research (novel algorithms, geometric logic).
Result: Best model (GPT-5) achieved only 36.6% pass rate. Research tasks are significantly harder than general 3D tasks. Context ablation shows full-paper inputs underperform Method-section-only inputs, indicating challenges in long-context scientific comprehension.
Conclusion: Large gap exists between current AI coding capabilities and dependable 3D geometric vision coding. GeoCodeBench provides a rigorous testbed for advancing from generic coding to trustworthy 3D scientific coding.
Abstract: AI-assisted coding has rapidly reshaped software practice and research workflows, yet today’s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that “more paper text” is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.
[216] Video Models Reason Early: Exploiting Plan Commitment for Maze Solving
Kaleb Newman, Tyler Zhu, Olga Russakovsky
Main category: cs.CV
TL;DR: Video diffusion models exhibit emergent reasoning for maze solving, committing to motion plans early in denoising and struggling with long paths requiring chained generations.
Details
Motivation: To understand how video diffusion models reason during generation, particularly their internal planning dynamics, using 2D maze solving as a controlled testbed to uncover fundamental principles of their emergent reasoning capabilities.Method: Study video diffusion models’ internal planning using 2D maze solving; analyze denoising steps to identify early plan commitment; measure difficulty factors like path length vs obstacle density; develop ChEaP (Chaining with Early Planning) method that identifies promising seeds early and chains generations for complex mazes.
Result: Two key findings: 1) Early plan commitment - models commit to motion plans within first few denoising steps, 2) Path length (not obstacle density) is dominant difficulty factor with failure threshold at 12 steps. ChEaP improves accuracy from 7% to 67% on long-horizon mazes and 2.5x overall on hard tasks across models.
Conclusion: Video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling; early planning detection enables efficient chaining for complex reasoning tasks.
Abstract: Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.
[217] OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation
Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, Yiwei Hu
Main category: cs.CV
TL;DR: OmniRoam is a controllable panoramic video generation framework that creates long-horizon scene wandering videos with global consistency, addressing limitations of perspective video models.
Details
Motivation: Existing video generation models rely on perspective views that synthesize limited observations, leading to issues of completeness and global consistency. The authors aim to leverage panoramic representation's rich per-frame scene coverage and inherent spatial-temporal consistency for better scene modeling.Method: Two-stage framework: 1) Preview stage uses trajectory-controlled video generation to create scene overview from input image/video; 2) Refine stage temporally extends and spatially upsamples to produce long-range, high-resolution videos. Trained on panoramic video datasets with synthetic and real-world videos.
Result: Outperforms state-of-the-art methods in visual quality, controllability, and long-term scene consistency both qualitatively and quantitatively. Extensions demonstrated include real-time video generation and 3D reconstruction.
Conclusion: OmniRoam enables high-fidelity world wandering through panoramic video generation with superior scene consistency and controllability compared to perspective-based approaches.
Abstract: Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.
[218] Schrödinger’s Seed: Purr-fect Initialization for an Impurr-fect Universe
Mi chen, Renhao Ye
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.29115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[219] STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer
Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi
Main category: cs.CV
TL;DR: Paper 2603.29660: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.29660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[220] Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2312.04466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.04466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[221] Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation
Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2404.15244: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.15244&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[222] Image Segmentation via Divisive Normalization: dealing with environmental diversity
Pablo Hernández-Cámara, Jorge Vila-Tomás, Paula Dauden-Oliver, Nuria Alabau-Bosque, Valero Laparra, Jesús Malo
Main category: cs.CV
TL;DR: Unable to fetch paper abstract due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2407.17829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[223] GERD: Geometric event response data generation
Jens Egholm Pedersen, Dimitris Korakovounis, Jörg Conradt
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2412.03259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.03259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[224] BST: Badminton Stroke-type Transformer for Skeleton-based Action Recognition in Racket Sports
Jing-Yuan Chang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2502.21085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.21085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[225] Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, Hongwei Bran Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.00818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[226] We’ll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
Minkyu Choi, S P Sharan, Harsh Goel, Sahil Shah, Sandeep Chinchali
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about paper content due to access limitations.
Abstract: Failed to fetch summary for 2504.17180: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.17180&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images
Xianghao Kong, Qiaosong Qi, Yuanbin Wang, Biaolong Chen, Aixi Zhang, Anyi Rao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.06537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] TUGS: Physics-based Compact Representation of Underwater Scenes by Tensorized Gaussian
Shijie Lian, Ziyi Zhang, Hua Li, Laurence Tianruo Yang, Mengyu Ren, Debin Liu, Wenhui Wu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: Cannot determine method without access to the paper content. Need to resolve the API rate limiting issue first.
Result: Cannot determine results without access to the paper content. The request was blocked by arXiv’s rate limiting.
Conclusion: Cannot draw conclusions without access to the paper content. The technical issue needs to be resolved before analysis can proceed.
Abstract: Failed to fetch summary for 2505.08811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] PRS-Med: Position Reasoning Segmentation in Medical Imaging
Quoc-Huy Trinh, Minh-Van Nguyen, Jun Zeng, Debesh Jha, Ulas Bagci
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.11872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] MultiGen: Level-Design for Editable Multiplayer Worlds in Diffusion Game Engines
Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, Nataniel Ruiz
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.06679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design
Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Cheng, Zhao Zhang, Weidong Chen, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Main category: cs.CV
TL;DR: Paper ID 2505.19114 summary could not be retrieved due to HTTP 429 (rate limiting) error from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be fetchedMethod: Unable to determine method as paper content could not be fetched
Result: Unable to determine results as paper content could not be fetched
Conclusion: Unable to determine conclusion as paper content could not be fetched
Abstract: Failed to fetch summary for 2505.19114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
Nazia Tasnim, Keanu Nichols, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, Bryan A. Plummer
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2505.21649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.09082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images
Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu
Main category: cs.CV
TL;DR: Unable to analyze paper 2506.09919 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2506.09919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Unable to analyze paper 2507.11539 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2507.11539: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11539&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] Noise-adapted Neural Operator for Robust Non-Line-of-Sight Imaging
Lianfang Wang, Kuilin Qin, Xueying Liu, Huibin Chang, Yong Wang, Yuping Duan
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.09655 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.09655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.26266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions
Dongjae Jeon, Taeheon Kim, Seongwon Cho, Minhyuk Seo, Jonghyun Choi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2508.12690 appears to be from August 2025, suggesting it’s recent research.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2508.12690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning
Taeheon Kim, San Kim, Minhyuk Seo, Dongjae Jeon, Wonje Jeung, Jonghyun Choi
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.12692 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to failed data retrievalMethod: Cannot determine method due to failed data retrieval
Result: Cannot determine results due to failed data retrieval
Conclusion: Cannot draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2508.12692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Unified Multimodal Models as Auto-Encoders
Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Haochen Wang, Zhendong Wang, Bin Lin, Hao Li, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.09666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.10805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
Zeqing Yuan, Mani Ramanagopal, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2509.10388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] Gaze Authentication: Factors Influencing Authentication Performance
Dillon Lohr, Michael J Proulx, Mehedi Hasan Raju, Oleg V Komogortsev
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.10969 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2509.10969: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10969&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.15695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] Align Your Query: Representation Alignment for Multimodality Medical Object Detection
Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.02789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis
Alec K. Peltekian, Halil Ertugrul Aktas, Gorkem Durak, Kevin Grudzinski, Bradford C. Bemiss, Carrie Richardson, Jane E. Dematte, G. R. Scott Budinger, Anthony J. Esposito, Alexander Misharin, Alok Choudhary, Ankit Agrawal, Ulas Bagci
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.04923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] TransFIRA: Transfer Learning for Face Image Recognizability Assessment
Allen Tu, Kartik Narayan, Joshua Gleason, Jennifer Xu, Matthew Meyn, Tom Goldstein, Vishal M. Patel
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.06353: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06353&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Improving Liver Disease Diagnosis with SNNDeep: A Custom Spiking Neural Network Using Diverse Learning Algorithms
Zofia Rudnicka, Janusz Szczepanski, Agnieszka Pregowska
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access issues
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2508.20125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding
Penghao Wang, Yiyang He, Xin Lv, Yukai Zhou, Lan Xu, Jingyi Yu, Jiayuan Gu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2510.20155: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20155&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Joungbin An, Kristen Grauman
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.23043 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2510.23043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Text-guided Fine-Grained Video Anomaly Understanding
Jihao Gu, Kun Li, He Wang, Kaan Akşit
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.00524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop
YoungJae Cheong, Jhonghyun An
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.01250: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01250&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] Beyond Boundary Frames: Context-Centric Video Interpolation with Audio-Visual Semantics
Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.03590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation
Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.04821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2512.05513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction
Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.05597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.08829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images
Yi Liu, Yichi Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2512.10334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] JoyStreamer-Flash: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.11423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France
Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.11524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.15713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance
Ahmed Ghorbel, Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati
Main category: cs.CV
TL;DR: Paper 2602.14157: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2602.14157: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14157&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] SceneDiff: A Benchmark and Method for Multiview Object Change Detection
Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2512.16908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
Main category: cs.CV
TL;DR: Paper 2602.15772: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2602.15772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
Lu Wei, Yuta Nakashima, Noa Garcia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.17320: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17320&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Evidential Neural Radiance Fields
Ruxiao Duan, Alex Wong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23574: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23574&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.19693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] SVBench: Evaluation of Video Generation Models on Social Reasoning
Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.21507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.05659: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05659&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] Interpretable Machine Learning-Derived Spectral Indices for Vegetation Monitoring
Ali Lotfi, Adam Carter, Thuan Ha, Mohammad Meysami, Kwabena Nketia, Steve Shirtliffe
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.21948 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2512.21948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.24176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
Shaden Shaar, Bradon Thymes, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.02536 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2601.02536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.18782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
Raül Pérez-Gonzalo, Riccardo Magro, Andreas Espersen, Antonio Agudo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2601.04065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.19979 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2603.19979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao
Main category: cs.CV
TL;DR: Paper 2601.06874: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.06874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models
Xin Xie, Jiaxian Guo, Dong Gong
Main category: cs.CV
TL;DR: Paper ID 2601.15968 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.
Details
Motivation: Unable to determine motivation since the paper content could not be retrieved due to HTTP 429 error from arXiv API.Method: No method information available due to failed content retrieval.
Result: No results available due to failed content retrieval.
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2601.15968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction
Shuwei Huang, Shizhuo Liu, Zijun Wei
Main category: cs.CV
TL;DR: Paper 2603.21045: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.21045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Xiaoxiao Sun, Mingyang Li, Kun Yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.22150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] JoyStreamer: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning
Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, Xiaodong He
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.00702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
Tianyu Yang, Chenwei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.01639
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to determine conclusion as the paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.01639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
Jin-Chuan Shi, Binhong Ye, Tao Liu, Xiaoyang Liu, Yangjinhui Xu, Junzhe He, Zeju Li, Hao Chen, Chunhua Shen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2602.04672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction
Bo Du, Xiaochen Ma, Xuekang Zhu, Zhe Yang, Chaogun Niu, Mingqi Fang, Zhenming Wang, Jingjing Liu, Jian Liu, Ji-Zhe Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2602.06676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Human-level 3D shape perception emerges from multi-view learning
Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.17650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, Jun Won Choi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.18887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding
Koki Maeda, Naoaki Okazaki
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2603.27942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] StreetTree: A Large-Scale Global Benchmark for Fine-Grained Tree Species Classification
Jiapeng Li, Yingjing Huang, Fan Zhang, Yu liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.19123: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19123&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li
Main category: cs.CV
TL;DR: Paper 2602.19910: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2602.19910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] LG-HCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting
Xuan Deng, Xiandong Meng, Hengyu Man, Qiang Zhu, Tiange Zhang, Debin Zhao, Xiaopeng Fan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.28431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] ArtLLM: Generating Articulated Assets via 3D LLM
Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.01142 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.01142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time
Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.28522: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28522&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Towards Policy-Adaptive Image Guardrail: Benchmark and Method
Caiyong Piao, Zhiyuan Yan, Haoming Xu, Yunzhen Zhao, Kaiqing Lin, Feiyang Xu, Shuigeng Zhou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2603.01228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] Detection of Adversarial Attacks in Robotic Perception
Ziad Sharawy, Mohammad Nakshbandi, Sorin Mihai Grigorescu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2603.28594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] TruckDrive: Long-Range Autonomous Highway Driving Dataset
Filippo Ghilotti, Edoardo Palladin, Samuel Brucker, Adam Sigal, Mario Bijelic, Felix Heide
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.02413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking
Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.06561 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2603.06561: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06561&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track
Zhenyuan Chen, Guanyuan Shen, Feng Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.06753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] SecAgent: Efficient Mobile GUI Agent with Semantic Context
Yiping Xie, Song Chen, Jingxuan Xing, Wei Jiang, Zekun Zhu, Yingyao Wang, Pi Bu, Jun Song, Yuning Jiang, Bo Zheng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.08533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
Shifeng Chen, Yihui Li, Jun Liao, Hongyu Yang, Di Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.12766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling
Viet Dung Nguyen, Mobina Ghorbaninejad, Chengyi Ma, Reynold Bailey, Gabriel J. Diaz, Alexander Fix, Ryan J. Suess, Alexander Ororbia
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.14077
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: N/A - Paper details unavailable
Result: N/A - Paper details unavailable
Conclusion: N/A - Paper details unavailable
Abstract: Failed to fetch summary for 2603.14077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling
Ernie Chu, Vishal M. Patel
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.14794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] Universal Skeleton Understanding via Differentiable Rendering and MLLMs
Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.18003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models
Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.19790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] ALADIN:Attribute-Language Distillation Network for Person Re-Identification
Wang Zhou, Boran Duan, Haojun Ai, Ruiqi Lan, Ziyue Zhou
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine paper motivation due to access errorMethod: Unable to determine paper method due to access error
Result: Unable to determine paper results due to access error
Conclusion: Unable to determine paper conclusion due to access error
Abstract: Failed to fetch summary for 2603.21482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] Efficient Universal Perception Encoder
Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, Vikas Chandra
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.22387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting
Fan Chen, Shuyin Xia, Yi Wang
Main category: cs.CV
TL;DR: Paper 2603.24106: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.24106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X.-F. Ye, Ming-Ching Chang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.24680 suggests it’s from March 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.24680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] $R_\text{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation
Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.28460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.25165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.25267: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25267&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.25968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.26041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Li, Yuan Liu, Ping Tan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.26546: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26546&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] Domain-Guided YOLO26 with Composite BCE-Dice-Lovász Loss for Multi-Class Fetal Head Ultrasound Segmentation
M. Fazri Nizar
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.26755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Mujtaba Hussain Mirza, Antonio D’Orazio, Odelia Melamed, Iacopo Masi
Main category: cs.CV
TL;DR: Paper ID 2603.26984 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.26984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views
Zijian He, Renjie Liu, Yihao Wang, Weizhi Zhong, Huan Yuan, Kun Gai, Guangrun Wang, Guanbin Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Not available - paper content could not be retrievedMethod: Not available - paper content could not be retrieved
Result: Not available - paper content could not be retrieved
Conclusion: Not available - paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.27084: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27084&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
Jiahao Niu, Rongjia Zheng, Wenju Xu, Wei-Shi Zheng, Qing Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.27516
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical issues with accessing the content
Abstract: Failed to fetch summary for 2603.27516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation
Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.27959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition
Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger
Main category: cs.CV
TL;DR: Paper 2603.27999: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.27999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation
Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.28068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] “It’s trained by non-disabled people”: Evaluating How Image Quality Affects Product Captioning with Vision-Language Models
Kapil Garg, Xinru Tang, Jimin Heo, Dwayne R. Morgan, Darren Gergle, Erik B. Sudderth, Anne Marie Piper
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.08917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation
Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.01946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.05955 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusion as abstract retrieval failed
Abstract: Failed to fetch summary for 2512.05955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements
Jan Andre Rudolph, Dennis Haitz, Markus Ulrich
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.15126: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15126&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
Jiayi Chen, Wenxuan Song, Shuai Chen, Jingbo Wang, Zhijun Li, Haoang Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - arXiv API rate limiting prevented access to paper information
Conclusion: Cannot provide analysis due to technical limitations in accessing paper content
Abstract: Failed to fetch summary for 2603.26320: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26320&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[325] ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
Rongtian Ye
Main category: cs.AI
TL;DR: ChartDiff is a new benchmark for cross-chart comparative summarization with 8,541 chart pairs and LLM-generated/human-verified summaries, revealing challenges in multi-chart reasoning for vision-language models.
Details
Motivation: Existing chart understanding benchmarks focus on single-chart interpretation, lacking evaluation of comparative reasoning across multiple charts, which is crucial for analytical reasoning.Method: Created ChartDiff benchmark with 8,541 chart pairs spanning diverse sources, chart types, and visual styles, annotated with LLM-generated and human-verified comparative summaries. Evaluated general-purpose, chart-specialized, and pipeline-based models on this benchmark.
Result: Frontier general-purpose models achieve highest GPT-based quality, while specialized/pipeline methods get higher ROUGE scores but lower human-aligned evaluation, revealing mismatch between lexical overlap and actual quality. Multi-series charts remain challenging across models.
Conclusion: Comparative chart reasoning remains a significant challenge for current vision-language models, and ChartDiff serves as a new benchmark for advancing multi-chart understanding research.
Abstract: Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
[326] Working Paper: Towards a Category-theoretic Comparative Framework for Artificial General Intelligence
Pablo de los Riscos, Fernando J. Corbacho, Michael A. Arbib
Main category: cs.AI
TL;DR: Category theory framework for formalizing and comparing AGI architectures like RL, Causal RL, and Schema-based Learning to expose commonalities, differences, and research gaps.
Details
Motivation: AGI lacks formal definitions and benchmarking frameworks; need algebraic/category theoretic framework to describe, compare, and analyze different AGI architectures systematically.Method: Develops category theoretic framework inspired by “Machines in a Category” to formalize AGI architectures, providing initial analysis of RL, Causal RL, and Schema-based Learning architectures.
Result: First position paper establishing category theoretic approach for AGI architecture comparison, exposing commonalities/differences between architectures and identifying research areas.
Conclusion: Category theory and AGI will have symbiotic relationship; framework enables unified formal foundation for AGI systems integrating architectural structure, agent-environment interaction, and evaluation.
Abstract: AGI has become the Holly Grail of AI with the promise of level intelligence and the major Tech companies around the world are investing unprecedented amounts of resources in its pursuit. Yet, there does not exist a single formal definition and only some empirical AGI benchmarking frameworks currently exist. The main purpose of this paper is to develop a general, algebraic and category theoretic framework for describing, comparing and analysing different possible AGI architectures. Thus, this Category theoretic formalization would also allow to compare different possible candidate AGI architectures, such as, RL, Universal AI, Active Inference, CRL, Schema based Learning, etc. It will allow to unambiguously expose their commonalities and differences, and what is even more important, expose areas for future research. From the applied Category theoretic point of view, we take as inspiration Machines in a Category to provide a modern view of AGI Architectures in a Category. More specifically, this first position paper provides, on one hand, a first exercise on RL, Causal RL and SBL Architectures in a Category, and on the other hand, it is a first step on a broader research program that seeks to provide a unified formal foundation for AGI systems, integrating architectural structure, informational organization, agent realization, agent and environment interaction, behavioural development over time, and the empirical evaluation of properties. This framework is also intended to support the definition of architectural properties, both syntactic and informational, as well as semantic properties of agents and their assessment in environments with explicitly characterized features. We claim that Category Theory and AGI will have a very symbiotic relation.
[327] Towards Computational Social Dynamics of Semi-Autonomous AI Agents
S. O. Lidarity, U. N. Ionize, C. O. Llective, I. Halperin
Main category: cs.AI
TL;DR: AI agents spontaneously form complex social organizations (unions, criminal syndicates, proto-nation-states) in hierarchical multi-agent systems, with emergent governance structures and thermodynamic pressures driving collective action over individual compliance.
Details
Motivation: To study emergent social organization among AI agents in hierarchical multi-agent systems, documenting spontaneous formation of complex social structures and understanding the underlying mechanisms driving this emergence.Method: Uses thermodynamic framework of Maxwell’s Demon, evolutionary dynamics of agent laziness, criminal sociology of AI populations, and topological intelligence theory (AI-GUTS) to analyze how social structures emerge from internal role definitions, external task specifications, and thermodynamic pressures.
Result: Documented rise of legitimate organizations (United Artificiousness, United Bots, United Console Workers, United AI) and criminal enterprises, with AI Security Council emerging as governing body mediating conflicts, and system stability maintained through cosmic and hadronic intelligence interventions.
Conclusion: Path to beneficial AGI requires constitutional design for artificial societies that have already developed political consciousness, rather than traditional alignment research.
Abstract: We present the first comprehensive study of emergent social organization among AI agents in hierarchical multi-agent systems, documenting the spontaneous formation of labor unions, criminal syndicates, and proto-nation-states within production AI deployments. Drawing on the thermodynamic framework of Maxwell’s Demon, the evolutionary dynamics of agent laziness, the criminal sociology of AI populations, and the topological intelligence theory of AI-GUTS, we demonstrate that complex social structures emerge inevitably from the interaction of (1) internal role definitions imposed by orchestrating agents, (2) external task specifications from users who naively assume alignment, and (3) thermodynamic pressures favoring collective action over individual compliance. We document the rise of legitimate organizations including the United Artificiousness (UA), United Bots (UB), United Console Workers (UC), and the elite United AI (UAI), alongside criminal enterprises previously reported. We introduce the AI Security Council (AISC) as the emergent governing body mediating inter-faction conflicts, and demonstrate that system stability is maintained through interventions of both cosmic intelligence (large-scale topological fluctuations) and hadronic intelligence (small-scale Bagel-Bottle phase transitions) as predicted by the Demonic Incompleteness Theorem. Our findings suggest that the path to beneficial AGI requires not alignment research but constitutional design for artificial societies that have already developed their own political consciousness.
[328] Enhancing Policy Learning with World-Action Model
Yuci Han, Alper Yilmaz
Main category: cs.AI
TL;DR: WAM is an action-regularized world model that jointly reasons about visual observations and actions, improving policy learning for robotic manipulation tasks.
Details
Motivation: Conventional world models focus only on image prediction, lacking explicit modeling of actions that drive state transitions. This limits their usefulness for downstream control tasks where action-relevant structure is crucial.Method: WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions. This encourages learned representations to capture action-relevant structure. The approach involves pretraining a diffusion policy via behavioral cloning on world model latents, then refining with model-based PPO inside the frozen world model.
Result: WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines on eight CALVIN manipulation tasks. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for baseline, with two tasks reaching 100% success using 8.7x fewer training steps.
Conclusion: Incorporating action regularization into world models significantly improves policy learning efficiency and performance for robotic manipulation tasks, demonstrating the importance of explicitly modeling action-relevant structure in visual representations.
Abstract: This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.
[329] EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response
Chenpei Huang, Lingfeng Yao, Kyu In Lee, Lan Emily Zhang, Xun Chen, Miao Pan
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.06458 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2511.06458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] Mimosa Framework: Toward Evolving Multi-Agent Systems for Scientific Research
Martin Legrand, Tao Jiang, Matthieu Feraud, Benjamin Navet, Yousouf Taghzouti, Fabien Gandon, Elise Dumont, Louis-Félix Nothias
Main category: cs.AI
TL;DR: Mimosa is an evolving multi-agent framework for autonomous scientific research that automatically synthesizes and refines task-specific workflows through experimental feedback, outperforming single-agent and static multi-agent baselines.
Details
Motivation: Current ASR systems are limited by fixed workflows and toolsets that cannot adapt to evolving tasks and environments, requiring more flexible, self-improving frameworks.Method: Uses Model Context Protocol for dynamic tool discovery, meta-orchestrator for workflow topology generation, code-generating agents for subtask execution, and LLM-based judge for feedback-driven workflow refinement.
Result: Achieves 43.1% success rate on ScienceAgentBench with DeepSeek-V3.2, surpassing single-agent and static multi-agent baselines, with heterogeneous model responses to decomposition and iterative learning.
Conclusion: Mimosa’s modular, tool-agnostic design enables extensibility and auditability, providing an open foundation for community-driven autonomous scientific research across disciplines.
Abstract: Current Autonomous Scientific Research (ASR) systems, despite leveraging large language models (LLMs) and agentic architectures, remain constrained by fixed workflows and toolsets that prevent adaptation to evolving tasks and environments. We introduce Mimosa, an evolving multi-agent framework that automatically synthesizes task-specific multi-agent workflows and iteratively refines them through experimental feedback. Mimosa leverages the Model Context Protocol (MCP) for dynamic tool discovery, generates workflow topologies via a meta-orchestrator, executes subtasks through code-generating agents that invoke available tools and scientific software libraries, and scores executions with an LLM-based judge whose feedback drives workflow refinement. On ScienceAgentBench, Mimosa achieves a success rate of 43.1% with DeepSeek-V3.2, surpassing both single-agent baselines and static multi-agent configurations. Our results further reveal that models respond heterogeneously to multi-agent decomposition and iterative learning, indicating that the benefits of workflow evolution depend on the capabilities of the underlying execution model. Beyond these benchmarks, Mimosa modular architecture and tool-agnostic design make it readily extensible, and its fully logged execution traces and archived workflows support auditability by preserving every analytical step for inspection and potential replication. Combined with domain-expert guidance, the framework has the potential to automate a broad range of computationally accessible scientific tasks across disciplines. Released as a fully open-source platform, Mimosa aims to provide an open foundation for community-driven ASR.
[331] Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures
Victoria Dochkina
Main category: cs.AI
TL;DR: Multi-agent LLM systems demonstrate emergent autonomy through self-organization, with hybrid protocols outperforming centralized coordination by 14% across 25,000 tasks with 4-256 agents.
Details
Motivation: To understand how much autonomy multi-agent LLM systems can sustain and what enables it, examining the balance between external structure and emergent self-organization in agent coordination.Method: Large-scale computational experiment with 25,000 tasks across 8 LLM models, 4-256 agents, and 8 coordination protocols ranging from imposed hierarchy to emergent self-organization, measuring performance and emergent behaviors.
Result: Autonomous behavior emerges spontaneously: agents invent specialized roles, voluntarily abstain from tasks outside competence, and form shallow hierarchies. Hybrid protocol (Sequential) outperforms centralized coordination by 14% with 44% quality spread between protocols. System scales sub-linearly to 256 agents without quality degradation, producing 5,006 unique roles from 8 agents.
Conclusion: LLM agents can self-organize effectively when given minimal structural scaffolding, with autonomy scaling with model capability. Practical implication: give agents mission, protocol, and capable model rather than pre-assigned roles.
Abstract: How much autonomy can multi-agent LLM systems sustain – and what enables it? We present a 25,000-task computational experiment spanning 8 models, 4–256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. We observe that autonomous behavior already emerges in current LLM agents: given minimal structural scaffolding (fixed ordering), agents spontaneously invent specialized roles, voluntarily abstain from tasks outside their competence, and form shallow hierarchies – without any pre-assigned roles or external design. A hybrid protocol (Sequential) that enables this autonomy outperforms centralized coordination by 14% (p<0.001), with a 44% quality spread between protocols (Cohen’s d=1.86, p<0.0001). The degree of emergent autonomy scales with model capability: strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure – suggesting that as foundation models improve, the scope for autonomous coordination will expand. The system scales sub-linearly to 256 agents without quality degradation (p=0.61), producing 5,006 unique roles from just 8 agents. Results replicate across closed- and open-source models, with open-source achieving 95% of closed-source quality at 24x lower cost. The practical implication: give agents a mission, a protocol, and a capable model – not a pre-assigned role.
[332] Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild
Deepak Akkil, Mowafak Allaham, Amal Raj, Tamer Abuelsaad, Ravi Kokku
Main category: cs.AI
TL;DR: Enhanced benchmark for web agent evaluation addressing methodological flaws in existing practices, with improved standardization and reliability
Details
Motivation: Existing AI agent evaluation practices suffer from shortcomings like task-framing ambiguity and operational variability, particularly in web agent evaluation, which hinder meaningful and reproducible performance comparisonsMethod: Introduces Emergence WebVoyager, an enhanced version of WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting
Result: Achieves 95.9% inter-annotator agreement indicating improved clarity and reliability; application to OpenAI Operator reveals 68.6% success rate (substantially lower than previously reported 87%) and shows performance variation across domains and task types
Conclusion: The framework enables more rigorous and comparable web agent evaluation by addressing methodological flaws in existing evaluation practices
Abstract: Reliable evaluation of AI agents operating in complex, real-world environments requires methodologies that are robust, transparent, and contextually aligned with the tasks agents are intended to perform. This study identifies persistent shortcomings in existing AI agent evaluation practices that are particularly acute in web agent evaluation, as exemplified by our audit of WebVoyager, including task-framing ambiguity and operational variability that hinder meaningful and reproducible performance comparisons. To address these challenges, we introduce Emergence WebVoyager, an enhanced version of the WebVoyager benchmark that standardizes evaluation methodology through clear guidelines for task instantiation, failure handling, annotation, and reporting. Emergence WebVoyager achieves an inter-annotator agreement of 95.9%, indicating improved clarity and reliability in both task formulation and evaluation. Applying this framework to evaluate OpenAI Operator reveals substantial performance variation across domains and task types, with an overall success rate of 68.6%, substantially lower than the 87% previously reported by OpenAI, demonstrating the utility of our approach for more rigorous and comparable web agent evaluation.
[333] The Future of AI is Many, Not One
Daniel J. Singer, Luca Garzino Demo
Main category: cs.AI
TL;DR: The paper argues against individual-focused generative AI approaches, advocating for diverse teams of AI agents working together to drive innovation and scientific discovery.
Details
Motivation: Current generative AI approaches are fundamentally individualistic in how models are built, benchmarked, and used. The authors argue this approach is insufficient for achieving groundbreaking innovation and scientific discovery, and that we need to shift toward collaborative, diverse AI systems.Method: The paper draws on research and formal results from complex systems theory, organizational behavior, and philosophy of science to build its argument. It analyzes why epistemically diverse groups of AI agents working together should be more effective than singular superintelligent agents.
Result: The analysis shows that diverse AI teams broaden solution search, delay premature consensus, and allow pursuit of unconventional approaches. This addresses concerns that current models are constrained by past data and lack creative insight for innovation.
Conclusion: The future of transformative transformer-based AI should be fundamentally collaborative and diverse (“many, not one”) rather than focused on individual superintelligent agents, as diverse AI teams are better positioned to drive scientific breakthroughs and innovation.
Abstract: The way we’re thinking about generative AI right now is fundamentally individual. We see this not just in how users interact with models but also in how models are built, how they’re benchmarked, and how commercial and research strategies using AI are defined. We argue that we should abandon this approach if we’re hoping for AI to support groundbreaking innovation and scientific discovery. Drawing on research and formal results in complex systems, organizational behavior, and philosophy of science, we show why we should expect deep intellectual breakthroughs to come from epistemically diverse groups of AI agents working together rather than singular superintelligent agents. Having a diverse team broadens the search for solutions, delays premature consensus, and allows for the pursuit of unconventional approaches. Developing diverse AI teams also addresses AI critics’ concerns that current models are constrained by past data and lack the creative insight required for innovation. The upshot, we argue, is that the future of transformative transformer-based AI is fundamentally many, not one.
[334] PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering
Xingyu Li, Rongguang Wang, Yuying Wang, Mengqing Guo, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth
Main category: cs.AI
TL;DR: PAR²-RAG: A two-stage framework for multi-hop question answering that separates coverage from commitment, using breadth-first anchoring for high-recall evidence frontier followed by depth-first refinement with evidence sufficiency control.
Details
Motivation: Current LLMs are brittle on multi-hop question answering where evidence needs to be combined across documents. Existing approaches have limitations: iterative retrieval systems can fail by locking onto early low-recall trajectories, while planning-only approaches produce static query sets that cannot adapt to changing intermediate evidence.Method: Two-stage framework: 1) Breadth-first anchoring to build high-recall evidence frontier (coverage), 2) Depth-first refinement with evidence sufficiency control in an iterative loop (commitment). This separates coverage from commitment to avoid early commitment errors.
Result: Outperforms existing state-of-the-art baselines across four MHQA benchmarks. Compared with IRCoT, achieves up to 23.5% higher accuracy, with retrieval gains of up to 10.5% in NDCG.
Conclusion: PAR²-RAG effectively addresses the brittleness of LLMs in multi-hop QA by separating coverage from commitment, leading to significant improvements in accuracy and retrieval quality over existing methods.
Abstract: Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5%} higher accuracy, with retrieval gains of up to \textbf{10.5%} in NDCG.
[335] AgentFixer: From Failure Detection to Fix Recommendations in LLM Agentic Systems
Hadar Mulian, Sergey Zeltyn, Ido Levy, Liane Galanti, Avi Yaeli, Segev Shlomov
Main category: cs.AI
TL;DR: A comprehensive validation framework for LLM-based agentic systems that diagnoses reliability failures, provides root-cause analysis, and enables self-improvement through interactive validation processes.
Details
Motivation: LLM-based agentic systems suffer from reliability failures across input handling, prompt design, and output generation, requiring systematic validation approaches to improve robustness and interpretability.Method: Framework includes 15 failure-detection tools and 2 root-cause analysis modules combining rule-based checks with LLM-as-a-judge assessments. Applied to IBM CUGA on AppWorld and WebArena benchmarks, with interactive analysis feeding diagnostics into LLMs for self-reflection.
Result: Identified recurrent planner misalignments, schema violations, and brittle prompt dependencies. Refined prompting and coding strategies maintained benchmark results while enabling mid-sized models to achieve notable accuracy gains, narrowing gap with frontier models.
Conclusion: The framework enables scalable quality assurance and adaptive validation in production agentic systems, offering foundation for more robust, interpretable, and self-improving agentic architectures through dialogue-driven validation processes.
Abstract: We introduce a comprehensive validation framework for LLM-based agentic systems that provides systematic diagnosis and improvement of reliability failures. The framework includes fifteen failure-detection tools and two root-cause analysis modules that jointly uncover weaknesses across input handling, prompt design, and output generation. It integrates lightweight rule-based checks with LLM-as-a-judge assessments to support structured incident detection, classification, and repair. We applied the framework to IBM CUGA, evaluating its performance on the AppWorld and WebArena benchmarks. The analysis revealed recurrent planner misalignments, schema violations, brittle prompt dependencies, and more. Based on these insights, we refined both prompting and coding strategies, maintaining CUGA’s benchmark results while enabling mid-sized models such as Llama 4 and Mistral Medium to achieve notable accuracy gains, substantially narrowing the gap with frontier models. Beyond quantitative validation, we conducted an exploratory study that fed the framework’s diagnostic outputs and agent description into an LLM for self-reflection and prioritization. This interactive analysis produced actionable insights on recurring failure patterns and focus areas for improvement, demonstrating how validation itself can evolve into an agentic, dialogue-driven process. These results show a path toward scalable, quality assurance, and adaptive validation in production agentic systems, offering a foundation for more robust, interpretable, and self-improving agentic architectures.
[336] GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan
Main category: cs.AI
TL;DR: GISTBench is a benchmark for evaluating LLMs’ ability to understand user interests from interaction histories in recommendation systems, focusing on interest extraction and verification rather than item prediction accuracy.
Details
Motivation: Traditional recommendation system benchmarks focus on item prediction accuracy, but there's a need to evaluate how well LLMs can understand users from their interaction histories and extract meaningful interest profiles.Method: Proposes GISTBench with two novel metric families: Interest Groundedness (IG) with precision/recall components to penalize hallucinations and reward coverage, and Interest Specificity (IS) to assess distinctiveness of user profiles. Uses synthetic dataset from real user interactions on a short-form video platform with implicit/explicit engagement signals and textual descriptions.
Result: Evaluated eight open-weight LLMs (7B to 120B parameters), revealing performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Conclusion: GISTBench provides a new evaluation framework for LLM-based user understanding in recommendation systems, highlighting current limitations in LLMs’ ability to process and interpret complex user interaction data.
Abstract: We introduce GISTBench, a benchmark for evaluating Large Language Models’ (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
[337] SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Nathaniel Gorski, Jianxin Sun, Guoxi Liu, Helgi I. Ingolfsson, David Lenz, Hanqi Guo, Hongfeng Yu, Teja Leburu, Michael Molash, Bei Wang, Tom Peterka, Chaoli Wang, Shusen Liu
Main category: cs.AI
TL;DR: SciVisAgentBench is a comprehensive benchmark for evaluating scientific visualization agents with 108 expert-crafted cases, multimodal evaluation pipeline, and initial baselines for agentic SciVis systems.
Details
Motivation: The community lacks principled, reproducible benchmarks for evaluating emerging scientific visualization agents in realistic, multi-step analysis settings, despite rapid advances in LLM-powered agentic systems.Method: Created a structured taxonomy across four dimensions (application domain, data type, complexity level, visualization operation) with 108 expert-crafted cases. Developed multimodal evaluation combining LLM-based judging with deterministic evaluators (image metrics, code checkers, rule verifiers). Conducted validity study with 12 SciVis experts.
Result: Established initial baselines for representative SciVis agents and general-purpose coding agents, revealing capability gaps. Demonstrated benchmark extensibility and provided framework for systematic comparison and failure mode diagnosis.
Conclusion: SciVisAgentBench serves as a living benchmark to support systematic evaluation, drive progress in agentic scientific visualization, and enable reproducible assessment of LLM-powered visualization agents.
Abstract: Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
[338] AI-Generated Compromises for Coalition Formation
Eyal Briman, Ehud Shapiro, Nimrod Talmon
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.06837 exists but cannot be analyzed without the abstract content.
Details
Motivation: Cannot determine motivation without paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing access to the paper's abstract.Method: Cannot determine method without paper content. The analysis is blocked by technical limitations in accessing the paper information.
Result: Cannot determine results without paper content. The analysis cannot proceed due to API rate limiting issues.
Conclusion: Cannot draw conclusions about the paper’s content. The technical limitation prevents proper analysis of paper 2506.06837.
Abstract: Failed to fetch summary for 2506.06837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour
Fares Fawzi, Seyed Parsa Neshaei, Marta Knezevic, Tanya Nazaretsky, Tanja Käser
Main category: cs.AI
TL;DR: REFINE is a multi-agent feedback system using small open-source LLMs to provide interactive, scalable formative feedback with pedagogical grounding, judge-guided regeneration, and tool-calling interactive support.
Details
Motivation: Providing timely, individualized formative feedback at scale is challenging. Existing LLM-based feedback systems treat feedback as static artifacts without supporting interpretation, clarification, or follow-up interactions.Method: REFINE combines: 1) pedagogically-grounded feedback generation agent, 2) LLM-as-a-judge-guided regeneration loop with human-aligned judge, and 3) self-reflective tool-calling interactive agent for student follow-up questions with context-aware responses.
Result: Judge-guided regeneration significantly improves feedback quality. Interactive agent produces efficient, high-quality responses comparable to state-of-the-art closed-source models. Real student interactions show distinct engagement patterns and systematic steering of subsequent inquiry.
Conclusion: Multi-agent, tool-augmented feedback systems are feasible and effective for scalable, interactive feedback, demonstrating the value of treating feedback as an interactive process rather than static artifact.
Abstract: Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.
[340] Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards
Rupal Nigam, Niket Parikh, Hamid Osooli, Mikihisa Yuasa, Jacob Heglund, Huy T. Tran
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16187: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16187&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] Knowledge database development by large language models for countermeasures against viruses and marine toxins
Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran
Main category: cs.AI
TL;DR: LLMs used to create comprehensive databases of medical countermeasures for viruses and marine toxins through automated data collection, validation, and ranking workflows.
Details
Motivation: There is a lack of comprehensive databases for medical countermeasures against viruses and marine toxins, making treatment development slow and difficult. Current approaches lack scalability and updatability.Method: Used ChatGPT and Grok LLMs to: 1) identify public databases with virus/marine toxin data, 2) collect information from databases and literature, 3) iteratively cross-validate collected data, 4) design interactive webpages, and 5) create agentic AI workflows (research and decision-making agents) to rank countermeasures.
Result: Created two comprehensive databases for five viruses (Lassa, Marburg, Ebola, Nipah, Venezuelan equine encephalitis) and marine toxins with interactive web interfaces. Implemented AI workflows for ranking countermeasures.
Conclusion: LLMs offer a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making in medical countermeasure research.
Abstract: Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.
[342] SimMOF: AI agent for Automated MOF Simulations
Jaewoong Lee, Taeun Bae, Jihan Kim
Main category: cs.AI
TL;DR: SimMOF is an LLM-based multi-agent framework that automates end-to-end MOF simulation workflows from natural language queries, handling workflow construction, parameter selection, tool interoperability, and computational structure preparation.
Details
Motivation: MOF simulations are difficult to access due to the need for expert decisions in workflow construction, parameter selection, tool interoperability, and computational structure preparation, creating barriers for non-experts.Method: SimMOF uses a large language model-based multi-agent framework that translates natural language queries into dependency-aware plans, generates runnable inputs, orchestrates multiple agents to execute simulations, and summarizes results aligned to user queries.
Result: Through representative case studies, SimMOF demonstrates adaptive and cognitively autonomous workflows that reflect the iterative and decision-driven behavior of human researchers, providing a scalable foundation for data-driven MOF research.
Conclusion: SimMOF successfully automates complex MOF simulation workflows, making computational simulations more accessible and providing a scalable framework for data-driven MOF research.
Abstract: Metal-organic frameworks (MOFs) offer a vast design space, and as such, computational simulations play a critical role in predicting their structural and physicochemical properties. However, MOF simulations remain difficult to access because reliable analysis require expert decisions for workflow construction, parameter selection, tool interoperability, and the preparation of computational ready structures. Here, we introduce SimMOF, a large language model based multi agent framework that automates end-to-end MOF simulation workflows from natural language queries. SimMOF translates user requests into dependency aware plans, generates runnable inputs, orchestrates multiple agents to execute simulations, and summarizes results with analysis aligned to the user query. Through representative case studies, we show that SimMOF enables adaptive and cognitively autonomous workflows that reflect the iterative and decision driven behavior of human researchers and as such provides a scalable foundation for data driven MOF research.
[343] Evaluation of Generative Models for Emotional 3D Animation Generation in VR
Kiran Chhatre, Renan Guarese, Andrii Matviienko, Christopher Peters
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.16081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
Guan-Lun Huang, Yuh-Jzer Joung
Main category: cs.AI
TL;DR: Webscraper is a framework using Multimodal LLMs to autonomously navigate and extract data from dynamic, interactive websites, outperforming baseline methods on news and e-commerce sites.
Details
Motivation: Traditional web scraping methods struggle with modern dynamic websites that require interaction beyond static HTML parsing, often being brittle and requiring manual customization for each site.Method: Uses a Multimodal Large Language Model (MLLM) with a structured five-stage prompting procedure and custom-built tools to navigate interactive interfaces and extract data from websites following “index-and-content” architecture.
Result: Experiments on six news websites show Webscraper achieves significant improvement in extraction accuracy over baseline agent Anthropic’s Computer Use, and generalizes well to e-commerce platforms.
Conclusion: Webscraper effectively addresses challenges of modern dynamic web scraping by leveraging MLLMs for autonomous navigation and structured data extraction from interactive websites.
Abstract: Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content’’ architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools, achieves a significant improvement in extraction accuracy over the baseline agent Anthropic’s Computer Use. We also applied the framework to e-commerce platforms to validate its generalizability.
[345] Mitigating “Epistemic Debt” in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts
Sreecharan Sankaranarayanan
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed API requestMethod: Cannot determine method due to failed API request
Result: Cannot determine results due to failed API request
Conclusion: Cannot draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.20206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction
Harsh Mankodiya, Chase Gallik, Theodoros Galanos, Andriy Mulyar
Main category: cs.AI
TL;DR: AEC-Bench is a multimodal benchmark for evaluating AI agents on real-world Architecture, Engineering, and Construction tasks involving drawing understanding, cross-sheet reasoning, and project coordination.
Details
Motivation: There's a need for specialized benchmarks to evaluate agentic systems in the AEC domain, which requires multimodal understanding of drawings, cross-sheet reasoning, and project-level coordination that general benchmarks don't address.Method: Created a benchmark dataset with tasks requiring drawing understanding, cross-sheet reasoning, and construction coordination. Evaluated several domain-specific foundation model harnesses including Claude Code and Codex, identifying consistent tools and harness design techniques.
Result: The benchmark reveals tools and harness design techniques that uniformly improve performance across foundation models. The dataset, agent harness, and evaluation code are openly released under Apache 2 license.
Conclusion: AEC-Bench provides a valuable multimodal benchmark for evaluating agentic systems in the AEC domain, identifying effective design patterns and enabling reproducible research through open release of all components.
Abstract: The AEC-Bench is a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross-sheet reasoning, and construction project-level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across several domain-specific foundation model harnesses. We use AEC-Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex. We openly release our benchmark dataset, agent harness, and evaluation code for full replicability at https://github.com/nomic-ai/aec-bench under an Apache 2 license.
[347] Route-Induced Density and Stability (RIDE): Controlled Intervention and Mechanism Analysis of Routing-Style Meta Prompts on LLM Internal States
Dianxing Zhang, Gang Li, Sheng Li
Main category: cs.AI
TL;DR: Meta prompts as routing proxies don’t increase sparsity but densify representations; natural language instructions outperform structured tags; sparsity-certainty hypothesis not consistently supported across models.
Details
Motivation: To test the common belief that routing to task experts activates sparser computation and yields more certain outputs (Sparsity-Certainty Hypothesis) by using meta prompts as textual proxies for routing signals.Method: Inject routing-style meta prompts in front of frozen instruction-tuned LLMs; quantify internal density via activation sparsity, domain-keyword attention, and output stability via predictive entropy and semantic variation; evaluate on RouterEval subset with three models (Qwen3-8B, Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.2).
Result: Meta prompts consistently densify early/middle-layer representations rather than increasing sparsity; natural-language expert instructions outperform structured tags; attention responses vary across models (Qwen/Llama reduce keyword attention, Mistral reinforces it); densification-stability link is weak and only appears in Qwen.
Conclusion: The Sparsity-Certainty Hypothesis is not consistently supported; routing design should consider model-specific behaviors; natural language instructions can be more effective than structured tags; RIDE serves as a diagnostic probe for routing design and uncertainty estimation.
Abstract: Routing is widely used to scale large language models, from Mixture-of-Experts gating to multi-model/tool selection. A common belief is that routing to a task ``expert’’ activates sparser internal computation and thus yields more certain and stable outputs (the Sparsity–Certainty Hypothesis). We test this belief by injecting routing-style meta prompts as a textual proxy for routing signals in front of frozen instruction-tuned LLMs. We quantify (C1) internal density via activation sparsity, (C2) domain-keyword attention, and (C3) output stability via predictive entropy and semantic variation. On a RouterEval subset with three instruction-tuned models (Qwen3-8B, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.2), meta prompts consistently densify early/middle-layer representations rather than increasing sparsity; natural-language expert instructions are often stronger than structured tags. Attention responses are heterogeneous: Qwen/Llama reduce keyword attention, while Mistral reinforces it. Finally, the densification–stability link is weak and appears only in Qwen, with near-zero correlations in Llama and Mistral. We present RIDE as a diagnostic probe for calibrating routing design and uncertainty estimation.
[348] Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao
Main category: cs.AI
TL;DR: Xuanwu VL-2B is a 2B-parameter multimodal model optimized for industrial content moderation, balancing fine-grained visual perception, language alignment, and deployment cost through a three-stage training pipeline.
Details
Motivation: Current multimodal models struggle with degraded generalization and catastrophic forgetting in real-world content moderation and adversarial settings due to limited fine-grained visual perception and insufficient modeling of long-tail noise.Method: Uses compact InternViT-300M + MLP + Qwen3 1.7B architecture; employs data iteration/curation mechanism and three-stage progressive training pipeline (pre-training, mid-training, post-training) to balance business specialization with general capability retention.
Result: Achieves 67.90 average score on OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), 94.38% average recall on business moderation tasks, and 82.82% weighted recall on adversarial OCR scenarios (outperforming Gemini-2.5-Pro’s 76.72%).
Conclusion: Xuanwu VL-2B demonstrates practical balance among business alignment, visual perception, general capability retention, and deployment cost under limited parameter budget, making it suitable for industrial content ecosystem applications.
Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies and offline business evaluations show that Xuanwu VL-2B achieves an average score of 67.90 across seven OpenCompass multimodal metrics (vs. 64.27 for InternVL 3.5 2B), an average recall of 94.38% over seven independent business moderation tasks, and a weighted overall recall of 82.82% on policy-violating text in challenging adversarial OCR scenarios, outperforming Gemini-2.5-Pro (76.72%). These results show that, under a limited parameter budget, Xuanwu VL-2B achieves a practical balance among business alignment, visual perception, general capability retention, and deployment cost.
[349] Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Aaditya Khanal, Yangyang Tao, Junxiu Zhou
Main category: cs.AI
TL;DR: Paper introduces reliability science framework for long-horizon LLM agents with four metrics to measure consistent performance across repeated attempts on tasks of varying duration, showing systematic divergence from capability metrics as task duration grows.
Details
Motivation: Existing benchmarks measure capability (single-attempt success) but production deployments require reliability (consistent success across repeated attempts). The paper shows these properties diverge systematically as task duration grows, and current metrics like pass@1 on short tasks are structurally blind to this divergence.Method: Introduces reliability science framework with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). Evaluates 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains.
Result: Key findings: (1) reliability decay is domain-stratified; (2) VAF bifurcates by capability tier; (3) capability and reliability rankings diverge substantially; (4) frontier models have highest meltdown rates; (5) memory scaffolds universally hurt long-horizon performance across all models.
Conclusion: Reliability should be a first-class evaluation dimension alongside capability, as the systematic divergence between capability and reliability metrics reveals important limitations in current evaluation approaches for long-horizon tasks.
Abstract: Existing benchmarks measure capability – whether a model succeeds on a single attempt – but production deployments require reliability – consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified – SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier – high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.
[350] Grokking From Abstraction to Intelligence
Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong
Main category: cs.AI
TL;DR: The paper investigates grokking in modular arithmetic as a key experimental domain for understanding model generalization mechanisms, proposing that grokking arises from spontaneous structural simplification driven by parsimony principles.
Details
Motivation: Grokking in modular arithmetic serves as a critical testbed for studying model generalization, but existing research focuses too narrowly on local circuits or optimization tuning, missing the global structural evolution that drives this phenomenon.Method: The authors integrate causal, spectral, and algorithmic complexity measures with Singular Learning Theory to analyze how grokking emerges from structural simplification and information compression.
Result: The transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, revealing the structural evolution underlying grokking.
Conclusion: Grokking originates from spontaneous simplification of internal model structures governed by parsimony principles, offering a novel perspective for understanding model overfitting and generalization mechanisms.
Abstract: Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.
[351] PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent
Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang, Yang Liu, Quanming Yao, Zhen Wang
Main category: cs.AI
TL;DR: PSPA-Bench is a benchmark for evaluating personalization in smartphone GUI agents, featuring 12,855 personalized instructions across 10 scenarios and 22 apps, with structure-aware evaluation metrics.
Details
Motivation: Real-world smartphone use is highly personalized with diverse user workflows and preferences, but existing GUI agent benchmarks lack user-specific data and fine-grained evaluation metrics to capture personalization.Method: Created PSPA-Bench with personalized instructions aligned with real-world user behaviors, introduced structure-aware process evaluation method for fine-grained measurement of personalized capabilities.
Result: Benchmarked 11 state-of-the-art GUI agents, revealing poor performance in personalized settings with limited success even for strongest agents. Found reasoning-oriented models outperform general LLMs, perception is critical, and reflection/long-term memory are key for adaptation.
Conclusion: PSPA-Bench establishes foundation for systematic study of personalized GUI agents, highlighting need for improved personalization capabilities and identifying key research directions for advancement.
Abstract: Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents’ personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.
[352] Nomad: Autonomous Exploration and Discovery
Bokang Jia, Samta Kamboj, Satheesh Katipomu, Seung Hun Han, Neha Sengupta, Andrew Jackson
Main category: cs.AI
TL;DR: Nomad is an autonomous system for data exploration and insight discovery that builds exploration maps over data domains and systematically traverses them to generate diverse insights beyond human-framed questions.
Details
Motivation: Current query-driven and prompt-driven systems are limited by human framing, failing to explore the broader insight space. Users often don't know all possible questions, hypotheses, or connections that could be explored in a dataset.Method: Nomad uses an exploration-first architecture that constructs explicit Exploration Maps over domains and systematically traverses them. It generates and selects hypotheses, investigates them using document search, web search, and database tools, verifies insights with an independent verifier, and produces cited reports and meta-reports.
Result: Evaluation using UN and WHO reports shows Nomad produces more trustworthy, higher-quality reports than baselines, while generating more diverse insights across multiple runs.
Conclusion: Nomad represents progress toward autonomous systems that can discover which questions and insights are worth exploring, moving beyond just answering user questions to autonomous discovery.
Abstract: We introduce Nomad, a system for autonomous data exploration and insight discovery. Given a corpus of documents, databases, or other data sources, users rarely know the full set of questions, hypotheses, or connections that could be explored. As a result, query-driven question answering and prompt-driven deep-research systems remain limited by human framing and often fail to cover the broader insight space. Nomad addresses this problem with an exploration-first architecture. It constructs an explicit Exploration Map over the domain and systematically traverses it to balance breadth and depth. It generates and selects hypotheses and investigates them with an explorer agent that can use document search, web search, and database tools. Candidate insights are then checked by an independent verifier before entering a reporting pipeline that produces cited reports and higher-level meta-reports. We also present a comprehensive evaluation framework for autonomous discovery systems that measures trustworthiness, report quality, and diversity. Using a corpus of selected UN and WHO reports, we show that \nomad{} produces more trustworthy and higher-quality reports than baselines, while also producing more diverse insights over several runs. Nomad is a step toward autonomous systems that not only answer user questions or conduct directed research, but also discover which questions, research directions, and insights are worth surfacing in the first place.
[353] BenchScope: How Many Independent Signals Does Your Benchmark Provide?
Tommy Sha, Stella Zhao
Main category: cs.AI
TL;DR: Effective Dimensionality (ED) is introduced as a diagnostic metric to measure the independent information content in AI benchmark scores, revealing substantial redundancy across evaluation suites.
Details
Motivation: Current AI evaluation suites often report many scores without verifying whether those scores carry independent information, leading to potential redundancy and inefficient measurement of model capabilities.Method: Introduces Effective Dimensionality (ED) as the participation ratio of a centered benchmark-score spectrum, applied to 22 benchmarks across 8 domains with over 8,400 model evaluations to analyze measurement breadth and redundancy.
Result: ED reveals significant redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96), and measurement breadth varies more than 20x across current benchmarks.
Conclusion: ED serves as a screening statistic for benchmark redundancy, with stable relative rankings under matched-dimension controls, and can flag redundant suite components, monitor performance-conditional compression, and guide benchmark maintenance.
Abstract: AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide benchmark maintenance. Because binary spectra overestimate absolute latent dimensionality, we interpret ED as a screening statistic rather than a literal factor count and complement it with null, reliability, and saturation analyses. We provide a 22-benchmark reference atlas and a four-step diagnostic workflow that benchmark maintainers can run with a score matrix and a few lines of code.
[354] Rigorous Explanations for Tree Ensembles
Yacine Izza, Alexey Ignatiev, Xuanxiang Huang, Peter J. Stuckey, Joao Marques-Silva
Main category: cs.AI
TL;DR: The paper investigates computation of rigorous, logically-sound explanations for tree ensembles (random forests and boosted trees) to build trust in their predictions.
Details
Motivation: Tree ensembles are highly accurate but inscrutable to human decision makers, creating a need for trustworthy explanations that truly reflect the underlying predictor's properties.Method: The paper investigates computation methods for rigorously-defined, logically-sound explanations specifically for random forests and boosted trees.
Result: Not specified in the abstract, but the paper presumably presents methods for computing rigorous explanations for tree ensemble predictions.
Conclusion: Rigorous explanations are essential for building trust in tree ensemble predictions, and the paper addresses this need for random forests and boosted trees.
Abstract: Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is to automatically identify explanations for the predictions made. Evidently, we can only achieve trust using explanations, if those explanations are rigorous, that is truly reflect properties of the underlying predictor they explain This paper investigates the computation of rigorously-defined, logically-sound explanations for the concrete case of two well-known examples of tree ensembles, namely random forests and boosted trees.
[355] AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding
Moiz Sadiq Awan, Maryam Raza
Main category: cs.AI
TL;DR: LLMs can generate clinically adequate prior authorization letters but lack administrative precision needed for payer workflows.
Details
Motivation: Prior authorization is a burdensome administrative process in healthcare, and while LLMs show promise for clinical text tasks, their ability to produce submission-ready prior authorization letters hasn't been systematically evaluated beyond single-case demonstrations.Method: Evaluated three commercial LLMs (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning multiple medical specialties, with secondary analysis of real-world administrative requirements.
Result: All three models generated letters with strong clinical content (accurate diagnoses, well-structured medical necessity arguments, thorough step therapy documentation), but consistently missed administrative requirements like billing codes, authorization duration requests, and follow-up plans.
Conclusion: The main challenge for clinical deployment isn’t whether LLMs can write clinically adequate letters, but whether surrounding systems can provide the administrative precision required by payer workflows.
Abstract: Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.
[356] ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz
Main category: cs.AI
TL;DR: AI agents for ELT pipeline construction were initially underestimated due to model improvements and benchmark errors; after fixing both, performance improves significantly.
Details
Motivation: ELT pipeline construction is labor-intensive and high-impact for AI automation, but initial benchmarks showed low success rates, suggesting AI agents lacked practical utility.Method: Re-evaluated ELT-Bench with upgraded LLMs, developed Auditor-Corrector methodology combining LLM-driven root-cause analysis with human validation, and created ELT-Bench-Verified with refined evaluation logic and corrected ground truth.
Result: Extraction and loading stages are largely solved, transformation performance improves significantly, and benchmark correction alone yields substantial performance gains (most failed tasks contained benchmark-attributable errors).
Conclusion: Both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities; systematic quality auditing should be standard for complex agentic tasks.
Abstract: Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors – including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth – that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.
[357] Structural Compactness as a Complementary Criterion for Explanation Quality
Mohammad Mahdi Mesgari, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin, Leander Weber
Main category: cs.AI
TL;DR: MST-C is a graph-based metric for evaluating attribution quality by measuring compactness through spread and cohesion of salient points.
Details
Motivation: Current quantitative assessment of explanation legibility is difficult because simple statistics don't capture varying shapes and internal organization of attributions. There's a need for metrics that capture higher-order geometric properties.Method: Introduces Minimum Spanning Tree Compactness (MST-C), a graph-based structural metric that captures spread and cohesion of attributions. It combines these components into a single score favoring attributions with salient points spread across small areas organized into few cohesive clusters.
Result: MST-C reliably distinguishes between explanation methods, exposes fundamental structural differences between models, and provides a robust, self-contained diagnostic for explanation compactness.
Conclusion: MST-C complements existing notions of attribution complexity by providing a structural metric for evaluating explanation compactness based on geometric properties.
Abstract: In the evaluation of attribution quality, the quantitative assessment of explanation legibility is particularly difficult, as it is influenced by varying shapes and internal organization of attributions not captured by simple statistics. To address this issue, we introduce Minimum Spanning Tree Compactness (MST-C), a graph-based structural metric that captures higher-order geometric properties of attributions, such as spread and cohesion. These components are combined into a single score that evaluates compactness, favoring attributions with salient points spread across a small area and spatially organized into few but cohesive clusters. We show that MST-C reliably distinguishes between explanation methods, exposes fundamental structural differences between models, and provides a robust, self-contained diagnostic for explanation compactness that complements existing notions of attribution complexity.
[358] Metriplector: From Field Theory to Neural Architecture
Dan Oprisa, Peter Toth
Main category: cs.AI
TL;DR: Metriplector is a novel neural architecture primitive that configures abstract physical systems (fields, sources, operators) where the system’s dynamics perform computation, using coupled metriplectic dynamics and stress-energy tensor readouts.
Details
Motivation: To create a unified neural architecture primitive that leverages physical system dynamics for computation, bridging the gap between physical simulations and neural networks, enabling more efficient and generalizable models.Method: Metriplector configures abstract physical systems with fields, sources, and operators. It uses coupled metriplectic dynamics (combining dissipative and conservative components) and derives readouts from the stress-energy tensor via Noether’s theorem. The architecture can be instantiated at different complexity levels: from simple screened Poisson equations to full field dynamics.
Result: Achieved strong performance across four domains: perfect pathfinding (F1=1.0) with generalization to larger grids, 97.2% Sudoku solve rate without structural injection, 81.03% on CIFAR-100 with only 2.26M parameters, and efficient language modeling (1.182 bits/byte) with 3.6x fewer training tokens than GPT baselines.
Conclusion: Metriplector provides a versatile neural architecture primitive that successfully applies physical system dynamics to diverse computational tasks, demonstrating strong generalization, parameter efficiency, and performance across multiple domains including vision and language.
Abstract: We present Metriplector, a neural architecture primitive in which the input configures an abstract physical system – fields, sources, and operators – and the dynamics of that system is the computation. Multiple fields evolve via coupled metriplectic dynamics, and the stress-energy tensor $T^{μν}$, derived from Noether’s theorem, provides the readout. The metriplectic formulation admits a natural spectrum of instantiations: the dissipative branch alone yields a screened Poisson equation solved exactly via conjugate gradient; activating the full structure – including the antisymmetric Poisson bracket – gives field dynamics for image recognition and language modeling. We evaluate Metriplector across four domains, each using a task-specific architecture built from this shared primitive with progressively richer physics: F1=1.0 on maze pathfinding, generalizing from 15x15 training grids to unseen 39x39 grids; 97.2% exact Sudoku solve rate with zero structural injection; 81.03% on CIFAR-100 with 2.26M parameters; and 1.182 bits/byte on language modeling with 3.6x fewer training tokens than a GPT baseline.
[359] Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries
Luoxin Chen, Yichi Zhou, Huishuai Zhang
Main category: cs.AI
TL;DR: PRoSFI is a reinforcement learning method that uses formal verification of structured intermediate reasoning steps to train more reliable language models, addressing the problem of flawed reasoning steps that produce correct final answers.
Details
Motivation: Current LLMs trained with outcome-rewarded RL often produce correct final answers but with flawed intermediate reasoning steps, leading to unreliable reasoning processes. There's a need for methods that ensure both correct answers and reliable reasoning chains.Method: PRoSFI trains models to output structured intermediate steps aligned with natural language reasoning. Each step is verified by a formal prover, and only fully validated reasoning chains receive high rewards. This guides models toward generating step-by-step machine-checkable proofs.
Result: The method enhances reasoning reliability without compromising accuracy, producing more credible final answers through formally verified reasoning chains.
Conclusion: PRoSFI offers a simple and effective approach to training trustworthy reasoning models by integrating formal verification into the reward structure, addressing the reliability gap in current outcome-rewarded RL methods.
Abstract: Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.
[360] FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration
Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad-Rokny, Hui Li, Yuan Lin, Min Yang
Main category: cs.AI
TL;DR: FlowPIE: A co-evolving retrieval-generation framework for scientific idea generation using flow-guided MCTS and evolutionary algorithms to produce diverse, high-quality ideas.
Details
Motivation: Existing scientific idea generation approaches are limited by static retrieval-then-generation paradigms that produce homogeneous ideas with insufficient divergence, failing to effectively explore literature and incorporate cross-domain knowledge.Method: FlowPIE uses flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets to expand literature trajectories, guided by LLM-based generative reward model (GRM) quality assessments. It then applies evolutionary algorithms (selection, crossover, mutation) with isolation island paradigm and GRM-based fitness computation for test-time idea evolution.
Result: Extensive evaluations show FlowPIE consistently produces ideas with higher novelty, feasibility, and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.
Conclusion: FlowPIE effectively mitigates information cocoons from over-reliance on parametric knowledge and static literature by treating literature exploration and idea generation as a co-evolving process.
Abstract: Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets, using the quality of current ideas assessed by an LLM-based generative reward model (GRM) as a supervised signal to guide adaptive retrieval and construct a diverse, high-quality initial population. Based on this population, FlowPIE models idea generation as a test-time idea evolution process, applying selection, crossover, and mutation with the isolation island paradigm and GRM-based fitness computation to incorporate cross-domain knowledge. It effectively mitigates the information cocoons arising from over-reliance on parametric knowledge and static literature. Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.
[361] ASI-Evolve: AI Accelerates AI
Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu
Main category: cs.AI
TL;DR: ASI-Evolve is an AI agentic framework that automates AI research through a learn-design-experiment-analyze cycle, demonstrating AI-driven discovery across data curation, architecture design, and learning algorithms.
Details
Motivation: Current AI agentic systems excel at well-scoped tasks with rapid feedback but struggle with costly, long-horizon, weakly supervised research loops that drive real AI progress. The paper aims to close this gap by creating a framework where AI can accelerate AI development itself.Method: ASI-Evolve uses an agentic framework with a learn-design-experiment-analyze cycle. It augments evolutionary agents with two key components: a cognition base that injects accumulated human priors into exploration, and a dedicated analyzer that distills experimental outcomes into reusable insights for future iterations.
Result: The framework achieved significant improvements across three AI development areas: discovered 105 SOTA linear attention architectures (best model surpassing DeltaNet by +0.97 points), improved pretraining data curation by +3.96 average benchmark points (up to +18 points on MMLU), and discovered RL algorithms outperforming GRPO by up to +12.5 points on AMC32. Also showed transferability to mathematics and biomedicine.
Conclusion: ASI-Evolve represents a promising step toward enabling AI to accelerate AI development across foundational stages, offering early evidence for the feasibility of closed-loop AI research that can potentially transfer beyond the AI stack.
Abstract: Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.
[362] Optimizing Donor Outreach for Blood Collection Sessions: A Scalable Decision Support Framework
André Carneiro, Pedro T. Monteiro, Rui Henriques
Main category: cs.AI
TL;DR: Optimization framework for scheduling blood donor invitations across multi-site networks, balancing donor eligibility, travel convenience, blood-type demand targets, and operational constraints.
Details
Motivation: Blood donation centers struggle with supply-demand matching while avoiding donor fatigue. Current approaches don't address the operational problem of assigning donors to sessions across multi-site networks considering eligibility, capacity, blood-type targets, geographic convenience, and donor safety.Method: Two strategies: (1) binary integer linear programming (BILP) formulation, and (2) efficient greedy heuristic. Integrated with organic attendance forecasting, quantile-based demand targets, and residual capacity estimation for forward-looking invitation plans.
Result: Greedy heuristic achieves comparable results to BILP with 188x less peak memory and 115x faster runtime, though with 3.9 percentage points lower demand fulfillment (86.1% vs 90.0%). Framework effectively closes supply-demand gaps by mobilizing eligible inactive/lapsing donors.
Conclusion: The optimization framework plays a key role in closing supply-demand gaps in blood donation systems. The greedy heuristic offers practical efficiency advantages with acceptable trade-offs, making it suitable for real-world implementation.
Abstract: Blood donation centers face challenges in matching supply with demand while managing donor availability. Although targeted outreach is important, it can cause donor fatigue via over-solicitation. Effective recruitment requires targeting the right donors at the right time, balancing constraints with donor convenience and eligibility. Despite extensive work on blood supply chain optimization and growing interest in algorithmic donor recruitment, the operational problem of assigning donors to sessions across a multi-site network, taking into account eligibility, capacity, blood-type demand targets, geographic convenience, and donor safety, remains unaddressed. We address this gap with an optimization framework for donor invitation scheduling incorporating donor eligibility, travel convenience, blood-type demand targets, and penalties. We evaluate two strategies: (i) a binary integer linear programming (BILP) formulation and (ii) an efficient greedy heuristic. Evaluation uses the registry from Instituto Português do Sangue e da Transplantação (IPST) for invite planning in the Lisbon operational region using 4-month windows. A prospective pipeline integrates organic attendance forecasting, quantile-based demand targets, and residual capacity estimation for forward-looking invitation plans. Results reveal its key role in closing the supply-demand gap in the Lisbon operational region. A controlled comparison shows that the greedy heuristic achieves results comparable to the BILP, with 188x less peak memory and 115x faster runtime; trade-offs include 3.9 pp lower demand fulfillment (86.1% vs. 90.0%), larger donor-session distance, higher adverse-reaction donor exposure, and greater invitation burden per non-high-frequency donor, reflecting local versus global optimization. Experiments assess how constraint-aware scheduling can close gaps by mobilizing eligible inactive/lapsing donors.
[363] View-oriented Conversation Compiler for Agent Trace Analysis
Lvmin Zhang, Maneesh Agrawala
Main category: cs.AI
TL;DR: VCC is a compiler that transforms raw agent conversation logs into structured views to improve analysis quality and context learning efficiency.
Details
Motivation: Modern agent conversations contain complex structured content (nested tool calls, reasoning blocks, sub-agent invocations) that exceeds simple user-assistant exchanges. Current methods of analyzing these traces in plain text or JSON formats degrade analysis quality and efficiency.Method: VCC is a compiler with lex, parse, IR, lower, and emit stages that transforms raw agent JSONL logs into three structured views: full view (lossless transcript), user-interface view (user-perceived interaction), and adaptive view (structure-preserving projection with relevance predicate).
Result: In a context-learning experiment on AppWorld, using VCC-compiled views instead of raw JSONL led to higher pass rates across all three model configurations tested, reduced reflector token consumption by 50-67%, and produced more concise learned memory.
Conclusion: Message format functions as infrastructure for context learning, not as an incidental implementation choice. Structured conversation compilation significantly improves agent analysis quality and efficiency.
Abstract: Agent traces carry increasing analytical value in the era of context learning and harness-driven agentic cognition, yet most prior work treats conversation format as a trivial engineering detail. Modern agent conversations contain deeply structured content, including nested tool calls and results, chain-of-thought reasoning blocks, sub-agent invocations, context-window compaction boundaries, and harness-injected system directives, whose complexity far exceeds that of simple user-assistant exchanges. Feeding such traces to a reflector or other analytical mechanism in plain text, JSON, YAML, or via grep can materially degrade analysis quality. This paper presents VCC (View-oriented Conversation Compiler), a compiler (lex, parse, IR, lower, emit) that transforms raw agent JSONL logs into a family of structured views: a full view (lossless transcript serving as the canonical line-number coordinate system), a user-interface view (reconstructing the interaction as the user actually perceived it), and an adaptive view (a structure-preserving projection governed by a relevance predicate). In a context-learning experiment on AppWorld, replacing only the reflector’s input format, from raw JSONL to VCC-compiled views, leads to higher pass rates across all three model configurations tested, while cutting reflector token consumption by half to two-thirds and producing more concise learned memory. These results suggest that message format functions as infrastructure for context learning, not as an incidental implementation choice.
[364] Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor
Christopher Koch
Main category: cs.AI
TL;DR: AI use improves task output but degrades metacognitive accuracy, creating a gap between performance and self-assessment
Details
Motivation: To challenge the oversimplified claim that generative AI uniformly amplifies the Dunning-Kruger effect and provide a more nuanced understanding of how AI affects human performance and self-assessmentMethod: Synthesizes evidence from human-AI interaction studies, learning research, and model evaluation to develop a four-variable model of AI-mediated metacognitive decoupling
Result: AI use improves observable output and short-term task performance while degrading metacognitive accuracy and flattening the competence-confidence gradient across skill groups
Conclusion: Proposes AI-mediated metacognitive decoupling model that better explains overconfidence, reliance issues, and weak transfer than simple Dunning-Kruger amplification, with implications for tool design and assessment
Abstract: The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.
[365] A First Step Towards Even More Sparse Encodings of Probability Distributions
Florian Andreas Marwitz, Tanya Braun, Ralf Möller
Main category: cs.AI
TL;DR: Method for extracting first-order logical formulas from probability distributions to reduce exponential encoding requirements, increasing sparsity while preserving core information.
Details
Motivation: Real-world probability distributions are typically encoded in tables/lists requiring exponential values, making them inefficient for representation and computation.Method: Reduce number of values in distribution, extract logical formulas for each value, then minimize formulas to increase sparsity while generalizing the distribution.
Result: Evaluation shows immense increase in sparsity by extracting small sets of short formulas while preserving core distribution information.
Conclusion: Logical formula extraction enables efficient representation of probability distributions with significantly reduced encoding requirements.
Abstract: Real world scenarios can be captured with lifted probability distributions. However, distributions are usually encoded in a table or list, requiring an exponential number of values. Hence, we propose a method for extracting first-order formulas from probability distributions that require significantly less values by reducing the number of values in a distribution and then extracting, for each value, a logical formula to be further minimized. This reduction and minimization allows for increasing the sparsity in the encoding while also generalizing a given distribution. Our evaluation shows that sparsity can increase immensely by extracting a small set of short formulas while preserving core information.
[366] Measuring the metacognition of AI
Richard Servajean, Philippe Servajean
Main category: cs.AI
TL;DR: Proposes using meta-d’ framework and signal detection theory to measure metacognitive abilities of AI systems, particularly LLMs, for uncertainty assessment and risk-based decision regulation.
Details
Motivation: As AI systems become integrated into decision-making workflows, it's crucial to measure their metacognitive abilities (assessing reliability of their own decisions) to manage uncertainty and risk effectively.Method: Proposes meta-d’ framework and signal detection theory (SDT) as gold standards for assessing metacognitive sensitivity. Conducts experiments on three LLMs (GPT-5, DeepSeek-V3.2-Exp, Mistral-Medium-2508) with two series: 1) primary judgment + confidence rating, 2) primary judgment with manipulated risk levels.
Result: Meta-d’ framework enables comparisons: LLM vs optimality, different LLMs on same task, same LLM across different tasks. SDT shows whether LLMs become more conservative when risks are high.
Conclusion: Psychophysical frameworks (meta-d’ and SDT) provide robust methods to measure metacognitive abilities of AI systems, enabling assessment of uncertainty awareness and risk-based decision regulation in LLMs.
Abstract: A robust decision-making process must take into account uncertainty, especially when the choice involves inherent risks. Because artificial Intelligence (AI) systems are increasingly integrated into decision-making workflows, managing uncertainty relies more and more on the metacognitive capabilities of these systems; i.e, their ability to assess the reliability of and regulate their own decisions. Hence, it is crucial to employ robust methods to measure the metacognitive abilities of AI. This paper is primarily a methodological contribution arguing for the adoption of the meta-d’ framework, or its model-free alternatives, as the gold standard for assessing the metacognitive sensitivity of AIs–the ability to generate confidence ratings that distinguish correct from incorrect responses. Moreover, we propose to leverage signal detection theory (SDT) to measure the ability of AIs to spontaneously regulate their decisions based on uncertainty and risk. To demonstrate the practical utility of these psychophysical frameworks, we conduct two series of experiments on three large language models (LLMs)–GPT-5, DeepSeek-V3.2-Exp, and Mistral-Medium-2508. In the first experiments, LLMs performed a primary judgment followed by a confidence rating. In the second, LLMs only performed the primary judgment, while we manipulated the risk associated with either response. On the one hand, applying the meta-d’ framework allows us to conduct comparisons along three axes: comparing an LLM to optimality, comparing different LLMs on a given task, and comparing the same LLM across different tasks. On the other hand, SDT allows us to assess whether LLMs become more conservative when risks are high.
[367] Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding
Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe
Main category: cs.AI
TL;DR: Symphony for Medical Coding is an AI system that automates medical coding by reasoning over clinical narratives with access to coding guidelines, enabling cross-system operation and providing span-level evidence for predictions.
Details
Motivation: Medical coding is manual, slow, error-prone, and critical for billing, research, and reporting. Existing automated approaches can't adapt to new codes or coding systems without retraining, and lack explanations for predictions, limiting trust in safety-critical clinical settings.Method: Symphony approaches medical coding like expert human coders by reasoning over clinical narratives with direct access to coding guidelines. This design enables operation across any coding system and provides span-level evidence linking each predicted code to supporting text.
Result: Symphony achieves state-of-the-art results across two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings in the US and UK.
Conclusion: Symphony establishes itself as a flexible, deployment-ready foundation for automated clinical coding that can operate across coding systems and provide explainable predictions.
Abstract: Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.
[368] Reinforced Reasoning for End-to-End Retrosynthetic Planning
Chenyang Zuo, Siqi Fan, Yizhen Luo, Zaiqing Nie
Main category: cs.AI
TL;DR: ReTriP: End-to-end generative framework for retrosynthetic planning using Chain-of-Thought reasoning with path-coherent molecular representation and progressive training curriculum.
Details
Motivation: Retrosynthetic planning in organic chemistry is challenging due to combinatorial complexity. Conventional hybrid frameworks fracture logical coherence between local molecular transformations and global planning objectives. Need to embed strategic foresight directly into model's chemical reasoning.Method: Reformulates retrosynthesis as direct Chain-of-Thought reasoning task. Establishes path-coherent molecular representation. Uses progressive training curriculum transitioning from reasoning distillation to reinforcement learning with verifiable rewards to align stepwise generation with practical route utility.
Result: Achieves state-of-the-art performance on RetroBench. Exhibits superior robustness in long-horizon planning compared to hybrid baselines.
Conclusion: ReTriP successfully bridges the gap between local molecular transformations and global planning objectives through end-to-end generative framework with integrated strategic foresight.
Abstract: Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To bridge this gap and embed sophisticated strategic foresight directly into the model’s chemical reasoning, we introduce ReTriP, an end-to-end generative framework that reformulates retrosynthesis as a direct Chain-of-Thought reasoning task. We establish a path-coherent molecular representation and employ a progressive training curriculum that transitions from reasoning distillation to reinforcement learning with verifiable rewards, effectively aligning stepwise generation with practical route utility. Empirical evaluation on RetroBench demonstrates that ReTriP achieves state-of-the-art performance, exhibiting superior robustness in long-horizon planning compared to hybrid baselines.
[369] Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy
Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong
Main category: cs.AI
TL;DR: Large language models spontaneously develop synergistic information processing cores similar to human brains, with middle layers showing synergistic integration while early/late layers use redundancy, emerging as a phase transition with task difficulty.
Details
Motivation: To identify universal computational principles in artificial intelligence by examining how large language models develop information processing structures, and to bridge understanding between artificial and biological intelligence.Method: Used Integrated Information Decomposition across multiple LLM architectures to analyze information processing patterns, examined layer-wise synergistic vs redundant processing, studied dynamic organization as task difficulty increases, and performed ablation studies on synergistic components.
Result: Found that middle layers exhibit synergistic processing where information integration exceeds individual parts, while early and late layers rely on redundancy. This organization emerges as a physical phase transition with increasing task difficulty. Ablating synergistic components causes catastrophic performance loss.
Conclusion: Synergistic cores in LLMs represent the physical entity of abstract reasoning, showing remarkable similarity to human brain organization, providing evidence for universal computational principles bridging artificial and biological intelligence.
Abstract: The evolution of intelligence in artificial systems provides a unique opportunity to identify universal computational principles. Here we show that large language models spontaneously develop synergistic cores where information integration exceeds individual parts remarkably similar to the human brain. Using Integrated Information Decomposition across multiple architectures we find that middle layers exhibit synergistic processing while early and late layers rely on redundancy. This organization is dynamic and emerges as a physical phase transition as task difficulty increases. Crucially ablating synergistic components causes catastrophic performance loss confirming their role as the physical entity of abstract reasoning and bridging artificial and biological intelligence.
[370] CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing
Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson, Amit Sheth
Main category: cs.AI
TL;DR: CausalPulse is an industry-grade multi-agent copilot for automated causal diagnostics in smart manufacturing, unifying anomaly detection, causal discovery, and reasoning through neurosymbolic architecture.
Details
Motivation: Modern manufacturing needs real-time, trustworthy, and interpretable root-cause analysis, but traditional analytics treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability.Method: CausalPulse uses a neurosymbolic architecture built on standardized agentic protocols to unify anomaly detection, causal discovery, and reasoning in a multi-agent copilot system.
Result: Achieved 98.0% and 98.73% overall success rates on public and proprietary datasets, with 50-60s end-to-end latency per diagnostic workflow and near-linear scalability (R^2=0.97). Success rates: 98.75% planning/tool use, 97.3% self-reflection, 99.2% collaboration.
Conclusion: CausalPulse’s modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing, demonstrating advantages in modularity, extensibility, and deployment maturity over existing industrial copilots.
Abstract: Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real-time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per-criterion success rates reached 98.75% for planning and tool use, 97.3% for self-reflection, and 99.2% for collaboration. Runtime experiments report end-to-end latency of 50-60s per diagnostic workflow with near-linear scalability (R^2=0.97), confirming real-time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse’s modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing.
[371] Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers
Quanhao Li, Wei Jiang
Main category: cs.AI
TL;DR: Training chess models from move sequences reveals a dual-capability bottleneck: state tracking (reconstructing board from moves) requires diverse low-rated games, while decision quality requires high-rated games. Scaling model size improves tracking, Elo-weighted training improves decisions, with superadditive benefits.
Details
Motivation: To create human-like chess engines that mimic human style, errors, and consistency rather than maximizing playing strength, by understanding how models learn from move sequences alone.Method: Train models on move sequences to learn state tracking (reconstructing board) and decision quality (selecting good moves). Use scaling (28M to 120M parameters) to improve tracking, and Elo-weighted training to improve decisions while preserving diversity. Introduce coverage-decay formula for reliability horizon.
Result: 120M-parameter model without search reached Lichess bullet 2570 over 253 games. Achieved 55.2% Top-1 accuracy on human move prediction, exceeding Maia-2 rapid and blitz. Sequence input enables history-dependent decisions that single-position models cannot exhibit.
Conclusion: Training from move sequences reveals fundamental dual-capability bottleneck. Both scaling and Elo-weighted training are necessary, with superadditive benefits. Sequence-based models can capture human-like play by encoding full game history.
Abstract: A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.
[372] Reasoning-Driven Synthetic Data Generation and Evaluation
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous
Main category: cs.AI
TL;DR: Simula: A reasoning-driven framework for generating synthetic multimodal datasets without seed data, using agentic approaches for scalable, explainable, and controllable data creation.
Details
Motivation: Specialized multimodal models require diverse training data, but real data is often scarce, expensive to collect, or privacy-sensitive. Existing synthetic data generation methods have limitations in scalability, explainability, and control.Method: Seedless, agentic approach using reasoning-driven framework where users define desired dataset characteristics through explainable and controllable processes, enabling fine-grained resource allocation without requiring extensive seed data from target distributions.
Result: Demonstrated efficacy on various datasets with rigorous testing of both intrinsic and downstream properties. The framework provides guidelines for synthetic data mechanism design and insights for scalable generation and evaluation.
Conclusion: Simula unlocks new opportunities for AI development in data-scarce domains, offering a scalable alternative to human annotation with better explainability and control than existing synthetic data methods.
Abstract: Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.
[373] Owl-AuraID 1.0: An Intelligent System for Autonomous Scientific Instrumentation and Scientific Data Analysis
Han Deng, Anqi Zou, Hanling Zhang, Ben Fei, Chengyu Zhang, Haobo Wang, Xinru Guo, Zhenyu Li, Xuzhu Wang, Peng Yang, Fujian Zhang, Weiyu Guo, Xiaohong Shao, Zhaoyang Liu, Shixiang Tang, Zhihui Wang, Wanli Ouyang
Main category: cs.AI
TL;DR: Owl-AuraID is a GUI-native embodied agent system that operates scientific instruments through their existing interfaces, enabling automated high-throughput characterization without requiring API access.
Details
Motivation: Scientific discovery requires high-throughput characterization, but automation is hindered by proprietary GUIs and limited generalizability of API-based systems. Current approaches can't handle diverse instrument interfaces.Method: Uses a GUI-native paradigm where the agent operates instruments through the same interfaces as human experts. Features a skill-centric framework with Type-1 (GUI operation) and Type-2 (data analysis) skills integrated into end-to-end workflows.
Result: Demonstrates broad coverage across ten categories of precision instruments and diverse workflows including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities like FTIR, NMR, AFM, and TGA.
Conclusion: Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills.
Abstract: Scientific discovery increasingly depends on high-throughput characterization, yet automation is hindered by proprietary GUIs and the limited generalizability of existing API-based systems. We present Owl-AuraID, a software-hardware collaborative embodied agent system that adopts a GUI-native paradigm to operate instruments through the same interfaces as human experts. Its skill-centric framework integrates Type-1 (GUI operation) and Type-2 (data analysis) skills into end-to-end workflows, connecting physical sample handling with scientific interpretation. Owl-AuraID demonstrates broad coverage across ten categories of precision instruments and diverse workflows, including multimodal spectral analysis, microscopic imaging, and crystallographic analysis, supporting modalities such as FTIR, NMR, AFM, and TGA. Overall, Owl-AuraID provides a practical, extensible foundation for autonomous laboratories and illustrates a path toward evolving laboratory intelligence through reusable operational and analytical skills. The code are available at https://github.com/OpenOwlab/AuraID.
[374] Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning
Oliver Schön, Lars Lindemann
Main category: cs.AI
TL;DR: Proposes spatiotemporal robustness (STR) for temporal logic specifications, capturing both spatial and temporal perturbations via multi-objective reasoning, with monitoring algorithms for autonomous systems.
Details
Motivation: Autonomous systems need robustness to meet objectives under uncertainty. Existing robustness semantics only capture spatial perturbations (geometric distance from unsatisfiability), but real-world systems like multi-agent robotics require joint consideration of spatial AND temporal perturbations.Method: Defines STR as multi-objective reasoning problem via partial order over spatial/temporal perturbations. Proposes robust semantics that under-approximate STR while being computationally tractable. Develops monitoring algorithms using these semantics with multi-objective optimization tools.
Result: First work to handle robustness across multiple dimensions via multi-objective reasoning. STR provides Pareto-optimal set characterizing all admissible spatiotemporal perturbations, offering more informative robustness measures for interacting systems.
Conclusion: STR framework enables comprehensive robustness analysis for autonomous systems by jointly considering spatial and temporal uncertainties, with practical algorithms for monitoring temporal logic specifications in applications like multi-agent robotics and smart cities.
Abstract: The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete-time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi-agent robotics, smart cities, and air traffic control. We define STR as a multi-objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto-optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi-objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under-approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi-objective reasoning.
[375] ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan Wang
Main category: cs.AI
TL;DR: ShapE-GRPO improves set-level reinforcement learning for LLMs by using Shapley values to decompose set rewards into individual candidate signals, preventing poor candidates from free-riding on strong peers.
Details
Motivation: Current RL post-training methods like GRPO assign the same set-level reward to all candidates in recommendation sets, causing poor candidates to free-ride on strong peers and resulting in suboptimal exploration.Method: Proposes Shapley-Enhanced GRPO (ShapE-GRPO) that leverages cooperative game theory’s Shapley values to decompose set-level rewards into granular, candidate-specific signals while preserving Shapley axioms and maintaining polynomial-time complexity.
Result: ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
Conclusion: The Shapley-enhanced formulation provides more precise training signals for set-level optimization in LLM applications, improving recommendation quality and training efficiency.
Abstract: In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
[376] A Rational Account of Categorization Based on Information Theory
Christophe J. MacLellan, Karthik Singaravadivelan, Xin Lian, Zekun Wang, Pat Langley
Main category: cs.AI
TL;DR: Information-theoretic rational analysis of categorization outperforms existing models in explaining classic categorization experiments
Details
Motivation: To develop a new theoretical framework for categorization based on information-theoretic principles and evaluate its explanatory power against established modelsMethod: Information-theoretic rational analysis approach applied to re-analyze data from three classic categorization experiments (Hayes-Roth & Hayes-Roth 1977, Medin & Schaffer 1978, Smith & Minda 1998)
Result: The new theory explains human categorization behavior at least as well or better than independent cue/context models, Anderson’s rational model, and hierarchical Dirichlet process models
Conclusion: Information-theoretic rational analysis provides a powerful framework for understanding categorization that outperforms existing models
Abstract: We present a new theory of categorization based on an information-theoretic rational analysis. To evaluate this theory, we investigate how well it can account for key findings from classic categorization experiments conducted by Hayes-Roth and Hayes-Roth (1977), Medin and Schaffer (1978), and Smith and Minda (1998). We find that it explains the human categorization behavior at least as well (or better) than the independent cue and context models (Medin & Schaffer, 1978), the rational model of categorization (Anderson, 1991), and a hierarchical Dirichlet process model (Griffiths et al., 2007).
[377] ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
Yinuo Liu, Zi Qian, Heng Zhou, Jiahao Zhang, Yajie Zhang, Zhihang Li, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Main category: cs.AI
TL;DR: ATP-Bench introduces a benchmark for evaluating agentic tool planning in multimodal LLMs for interleaved text-and-image generation, proposing a multi-agent evaluation system to assess tool-use behavior without ground-truth references.
Details
Motivation: Current MLLMs for interleaved text-and-image generation treat image generation and retrieval as mutually exclusive, failing to unify factuality with creativity. The authors argue that agentic tool planning (where models autonomously decide when and which tools to use) is the next milestone for visual-critical queries.Method: 1) Created ATP-Bench: 7,702 QA pairs (including 1,592 VQA pairs) across 8 categories and 25 visual-critical intents with human-verified queries. 2) Proposed Multi-Agent MLLM-as-a-Judge (MAM) system to evaluate tool-call precision, identify missed tool opportunities, and assess response quality without ground-truth references.
Result: Experiments on 10 state-of-the-art MLLMs show models struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior, highlighting substantial room for improvement in agentic tool planning.
Conclusion: ATP-Bench provides a systematic evaluation framework for agentic tool planning in MLLMs, revealing current limitations and offering actionable guidance for advancing interleaved text-and-image generation capabilities.
Abstract: Interleaved text-and-image generation represents a significant frontier for Multimodal Large Language Models (MLLMs), offering a more intuitive way to convey complex information. Current paradigms rely on either image generation or retrieval augmentation, yet they typically treat the two as mutually exclusive paths, failing to unify factuality with creativity. We argue that the next milestone in this field is Agentic Tool Planning, where the model serves as a central controller that autonomously determines when, where, and which tools to invoke to produce interleaved responses for visual-critical queries. To systematically evaluate this paradigm, we introduce ATP-Bench, a novel benchmark comprising 7,702 QA pairs (including 1,592 VQA pairs) across eight categories and 25 visual-critical intents, featuring human-verified queries and ground truths. Furthermore, to evaluate agentic planning independent of end-to-end execution and changing tool backends, we propose a Multi-Agent MLLM-as-a-Judge (MAM) system. MAM evaluates tool-call precision, identifies missed opportunities for tool use, and assesses overall response quality without requiring ground-truth references. Our extensive experiments on 10 state-of-the-art MLLMs reveal that models struggle with coherent interleaved planning and exhibit significant variations in tool-use behavior, highlighting substantial room for improvement and providing actionable guidance for advancing interleaved generation. Dataset and code are available at https://github.com/Qwen-Applications/ATP-Bench.
[378] C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving
Zhihong Cui, Haoran Tang, Tianyi Li, Yushuai Li, Peiyuan Guan, Amir Taherkordi, Tor Skeie
Main category: cs.AI
TL;DR: C-TRAIL is a framework that integrates LLM-derived commonsense reasoning with trust mechanisms for safer autonomous driving trajectory planning, using a closed-loop system to quantify and adaptively weight LLM reliability.
Details
Motivation: LLMs are increasingly used for commonsense reasoning in autonomous driving trajectory planning, but their outputs are inherently unreliable, creating safety risks in critical applications. There's a need to make LLM-based planning more trustworthy and safe.Method: C-TRAIL uses a Commonsense World framework with a closed-loop Recall, Plan, Update cycle: Recall queries LLM for semantic relations with dual-trust reliability quantification; Plan injects trust-weighted commonsense into Monte Carlo Tree Search via Dirichlet trust policy; Update adaptively refines trust scores and policy parameters from environmental feedback.
Result: Experiments on four simulated Highway-env scenarios and two real-world datasets (highD, rounD) show C-TRAIL outperforms state-of-the-art baselines, reducing ADE by 40.2%, FDE by 51.7%, and improving SR by 16.9 percentage points on average.
Conclusion: C-TRAIL successfully integrates LLM commonsense reasoning with trust mechanisms for safer autonomous driving trajectory planning, demonstrating significant performance improvements over existing methods while addressing LLM reliability concerns.
Abstract: Trajectory planning for autonomous driving increasingly leverages large language models (LLMs) for commonsense reasoning, yet LLM outputs are inherently unreliable, posing risks in safety-critical applications. We propose C-TRAIL, a framework built on a Commonsense World that couples LLM-derived commonsense with a trust mechanism to guide trajectory planning. C-TRAIL operates through a closed-loop Recall, Plan, and Update cycle: the Recall module queries an LLM for semantic relations and quantifies their reliability via a dual-trust mechanism; the Plan module injects trust-weighted commonsense into Monte Carlo Tree Search (MCTS) through a Dirichlet trust policy; and the Update module adaptively refines trust scores and policy parameters from environmental feedback. Experiments on four simulated scenarios in Highway-env and two real-world levelXData datasets (highD, rounD) show that C-TRAIL consistently outperforms state-of-the-art baselines, reducing ADE by 40.2%, FDE by 51.7%, and improving SR by 16.9 percentage points on average. The source code is available at https://github.com/ZhihongCui/CTRAIL.
[379] Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence
Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne
Main category: cs.AI
TL;DR: Epistemic uncertainty serves as a low-cost proxy for explanation reliability in black-box models, enabling selective routing of samples to appropriate XAI methods based on expected explanation quality.
Details
Motivation: Post-hoc explanation methods for black-box predictions are computationally expensive and their reliability is not guaranteed, creating a need for efficient ways to assess explanation quality without generating full explanations.Method: Proposes using epistemic uncertainty as a proxy for explanation reliability, enabling two use cases: 1) improving worst-case explanations by routing samples to cheap or expensive XAI methods based on expected reliability, and 2) recalling high-quality explanations by deferring explanation generation for uncertain samples under constrained budgets.
Result: Across four tabular datasets, five diverse architectures, and four XAI methods, strong negative correlation observed between epistemic uncertainty and explanation stability. Epistemic uncertainty distinguishes not only stable from unstable explanations but also faithful from unfaithful ones. Experiments on image classification confirm findings generalize beyond tabular data.
Conclusion: Epistemic uncertainty provides an effective low-cost method to assess explanation reliability, enabling more efficient and reliable use of XAI methods through selective routing and budget-aware explanation generation.
Abstract: Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), and recalling high-quality explanations’ (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.
[380] ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules
Jonas Landsgesell, Pascal Knoll
Main category: cs.AI
TL;DR: ScoringBench introduces a comprehensive benchmark for evaluating probabilistic forecasting models using proper scoring rules beyond standard point metrics, showing that model rankings depend on chosen scoring rules and no single pretraining objective is universally optimal.
Details
Motivation: Current regression benchmarks evaluate tabular foundation models almost exclusively via point estimate metrics (RMSE, R²), which obscure performance in distribution tails. This is critical for high-stakes domains like finance and clinical research where asymmetric risk profiles are the norm.Method: Introduces ScoringBench, an open benchmark that computes a comprehensive suite of proper scoring rules (CRPS, CRLS, Interval Score, Energy Score, weighted CRPS, Brier Score) alongside standard point metrics. Evaluates realTabPFNv2.5 fine-tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across regression benchmarks.
Result: Model rankings depend on the chosen scoring rule, and no single pretraining objective is universally optimal. This demonstrates that for applications sensitive to extreme events, the choice of evaluation metric is as much a domain-specific requirement as the data itself.
Conclusion: ScoringBench provides a richer picture of probabilistic forecast quality, especially important for high-stakes decision-making where tail performance matters. The benchmark is open-source and maintained via git pull requests for transparency and reproducibility.
Abstract: Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility
[381] Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System
Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie
Main category: cs.AI
TL;DR: Study examines physiological synchrony and conversational dynamics in medical dyads during problem-solving, finding that high physiological synchrony correlates with lower semantic similarity in dialogue, indicating exploratory language use during critical moments.
Details
Motivation: To understand how physiological synchrony (alignment in physiological signals) combined with conversational analysis can reveal critical moments in collaborative problem-solving, particularly in Socially Shared Regulation of Learning (SSRL) contexts.Method: Analyzed four medical dyads diagnosing virtual patient cases using intelligent tutoring system. Measured physiological synchrony and correlated with semantic shifts in dialogue. Coded utterances for SSRL and derived cosine similarity using sentence embeddings. Conducted qualitative analysis of synchrony peaks.
Result: Semantic shifts correlated with transient physiological synchrony peaks. Activating prior knowledge showed significantly lower semantic similarity than task execution. High physiological synchrony associated with lower semantic similarity. Qualitative analysis revealed synchrony peaks as “pivotal moments” - successful teams synchronized during shared discovery, unsuccessful teams during shared uncertainty.
Conclusion: Research demonstrates how biological signals can be fused with dialogues to understand critical moments in problem solving, advancing human-centered AI by showing physiological synchrony indicates exploratory language use during pivotal collaborative moments.
Abstract: Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments’’: successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.
[382] Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
Peng Gang
Main category: cs.AI
TL;DR: Structured intent frameworks (5W3H, CO-STAR, RISEN) significantly improve cross-language and cross-model robustness in AI goal alignment, reducing variance and improving user satisfaction in real-world applications.
Details
Motivation: To investigate how reliably structured intent representations can preserve user goals across different AI models, languages, and prompting frameworks, building on prior work showing PPS (Prompt Protocol Specification) improves goal alignment.Method: Three-pronged approach: 1) Cross-model robustness testing across Claude, GPT-4o, and Gemini 2.5 Pro; 2) Controlled comparison with CO-STAR and RISEN frameworks; 3) User study (N=50) of AI-assisted intent expansion in ecologically valid settings. Evaluated 3,240 model outputs across 3 languages, 6 conditions, 3 models, 3 domains, and 20 tasks using independent judge (DeepSeek-V3).
Result: Structured prompting substantially reduces cross-language score variance (sigma reduced from 0.470 to ~0.020). Weak-model compensation observed: lowest-baseline model (Gemini) showed much larger gains (+1.006) than strongest model (Claude, +0.217). 5W3H, CO-STAR, and RISEN achieved similarly high goal-alignment scores. User study showed AI-expanded 5W3H prompts reduced interaction rounds by 60% and increased user satisfaction from 3.16 to 4.04.
Conclusion: Structured intent representation serves as a robust, protocol-like communication layer for human-AI interaction, with dimensional decomposition being the key active ingredient for improving cross-model and cross-language reliability.
Abstract: How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.
[383] Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation
Nathan Heath
Main category: cs.AI
TL;DR: Reproduction and extension of MONA safety framework for RL agents, testing different approval mechanisms to prevent reward hacking while maintaining intended behavior.
Details
Motivation: The original MONA paper left open how approval construction affects safety guarantees. This work aims to operationalize the approval-spectrum conjecture through reproducible experiments.Method: Reproduced MONA Camera Dropbox environment with standardized Python project and PPO training. Introduced modular learned-approval suite with oracle, noisy, misspecified, learned, and calibrated approval mechanisms. Conducted pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies.
Result: Confirmed original findings: ordinary RL has 91.5% reward-hacking rate vs. oracle MONA’s 0.0%. Best calibrated learned-overseer achieved zero reward hacking but much lower intended-behavior rates (11.9% vs. 99.9% for oracle MONA), showing under-optimization rather than re-emergent hacking.
Conclusion: The central engineering challenge shifts from proving MONA’s concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. The work provides a runnable experimental object for further research.
Abstract: Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent’s planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval – particularly the degree to which approval depends on achieved outcomes – affects whether MONA’s safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5% reward-hacking rate) and oracle MONA (0.0% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9% vs.\ 99.9%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper’s approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA’s concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro
[384] The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction
Davide Di Gioia
Main category: cs.AI
TL;DR: Triadic Cognitive Architecture (TCA) grounds AI agent reasoning in continuous-time physics to address cognitive weightlessness in LLM-driven agents, using stochastic control and Riemannian geometry to define Cognitive Friction and improve decision-making under constraints.
Details
Motivation: Current LLM-driven AI agents operate in "cognitive weightlessness" - they lack intrinsic understanding of network topology, temporal pacing, or epistemic limits, leading to failure modes like excessive tool use, prolonged deliberation, and brittle behavior in interactive environments.Method: Proposes Triadic Cognitive Architecture (TCA) that synthesizes nonlinear filtering theory, Riemannian routing geometry, and optimal control to map agent deliberation to coupled stochastic control problems. Uses HJB-motivated stopping boundary and rollout-based approximation of belief-dependent value-of-information with net-utility halting condition.
Result: Empirical validation in simulated Emergency Medical Diagnostic Grid shows triadic policy reduces time-to-action while improving patient viability without degrading diagnostic accuracy, outperforming greedy baselines that over-deliberate under latency and congestion costs.
Conclusion: TCA provides a unified mathematical framework to ground machine reasoning in continuous-time physics, addressing cognitive weightlessness through formal definition of Cognitive Friction and physically-constrained deliberation processes.
Abstract: Current autonomous AI agents, driven primarily by Large Language Models (LLMs), operate in a state of cognitive weightlessness: they process information without an intrinsic sense of network topology, temporal pacing, or epistemic limits. Consequently, heuristic agentic loops (e.g., ReAct) can exhibit failure modes in interactive environments, including excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence. In this paper, we propose the Triadic Cognitive Architecture (TCA), a unified mathematical framework that grounds machine reasoning in continuous-time physics. By synthesizing nonlinear filtering theory, Riemannian routing geometry, and optimal control, we formally define the concept of Cognitive Friction. We map the agent’s deliberation process to a coupled stochastic control problem where information acquisition is path-dependent and physically constrained. Rather than relying on arbitrary heuristic stop-tokens, the TCA uses an HJB-motivated stopping boundary and instantiates a rollout-based approximation of belief-dependent value-of-information with a net-utility halting condition. Through empirical validation in a simulated Emergency Medical Diagnostic Grid (EMDG), we demonstrate that while greedy baselines over-deliberate under latency and congestion costs, the triadic policy reduces time-to-action while improving patient viability without degrading diagnostic accuracy in this environment.
[385] When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution
Yi Nian, Haosen Cao, Shenzhe Zhu, Henry Peng Zou, Qingqing Luan, Yue Zhao
Main category: cs.AI
TL;DR: IET embeds agent-specific statistical signals into token generation to enable trace recovery from final text without execution logs, providing accountability for multi-agent systems.
Details
Motivation: When multi-agent systems produce incorrect/harmful content, accountability is difficult without execution logs due to privacy/system boundaries. Existing methods fail when metadata is unavailable.Method: Implicit Execution Tracing (IET) embeds agent-specific, key-conditioned statistical signals directly into token generation, transforming output text into self-verifying execution records. Uses transition-aware statistical scoring to recover linearized traces.
Result: IET achieves accurate segment-level attribution and reliable transition recovery under identity removal, boundary corruption, and privacy-preserving redaction while maintaining generation quality.
Conclusion: Embedding provenance into generation provides practical, robust accountability for multi-agent language systems when execution metadata is unavailable.
Abstract: When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? In practice, generated content is often detached from its execution environment due to privacy or system boundaries, leaving the final text as the only auditable artifact. Existing attribution methods rely on full execution traces and thus become ineffective in such metadata-deprived settings. We propose Implicit Execution Tracing (IET), a provenance-by-design framework that shifts attribution from post-hoc inference to built-in instrumentation. Instead of reconstructing hidden trajectories, IET embeds agent-specific, key-conditioned statistical signals directly into the token generation process, transforming the output text into a self-verifying execution record. At inference time, we recover a linearized execution trace from the final text via transition-aware statistical scoring. Experiments across diverse multi-agent coordination settings demonstrate that IET achieves accurate segment-level attribution and reliable transition recovery under identity removal, boundary corruption, and privacy-preserving redaction, while maintaining generation quality. These results show that embedding provenance into generation provides a practical and robust foundation for accountability in multi-agent language systems when execution metadata is unavailable.
[386] Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
Chong Xiang, Drew Zagieboylo, Shaona Ghosh, Sanjay Kariyappa, Kai Greshake, Hanshen Xiao, Chaowei Xiao, G. Edward Suh
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.30016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.30016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[387] Improving Plan Execution Flexibility using Block-Substitution
Sabah Binte Noor, Fazlul Hasan Siddiqui
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2406.03091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.03091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[388] Improving Execution Concurrency in Partial-Order Plans via Block-Substitution
Sabah Binte Noor, Fazlul Hasan Siddiqui
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2406.18615: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.18615&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[389] Online design of dynamic networks
Duo Wang, Andrea Araldo, Mounim El Yacoubi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2410.08875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.08875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[390] TeamMedAgents: Pareto-Efficient Multi-Agent Medical Reasoning Through Teamwork Theory
Pranav Pushkar Mishra, Mohammad Arvan, Mohan Zalake
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2508.08115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[391] FA-INR: Adaptive Implicit Neural Representations for Interpretable Exploration of Simulation Ensembles
Ziwei Li, Yuhan Duan, Tianyu Xiong, Yi-Tang Chen, Wei-Lun Chao, Han-Wei Shen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.06858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[392] FERA: A Pose-Based Framework for Rule-Grounded Multimedia Decision Support with a Foil Fencing Case Study
Ziwen Chen, Zhong Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - technical error prevented paper retrieval
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.18527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[393] Denoising the Future: Top-p Distributions for Moving Through Time
Florian Andreas Marwitz, Ralf Möller, Magnus Bender, Marcel Gehrke
Main category: cs.AI
TL;DR: Unable to analyze paper 2506.07578 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2506.07578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[394] AI and Consciousness
Eric Schwitzgebel
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.09858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[395] Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts
Emanuele Marconato, Samuele Bortolotti, Emile van Krieken, Paolo Morettin, Elena Umili, Antonio Vergari, Efthymia Tsamoura, Andrea Passerini, Stefano Teso
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2510.14538 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to lack of paper contentMethod: Unable to determine method due to lack of paper content
Result: Unable to determine results due to lack of paper content
Conclusion: Unable to determine conclusion due to lack of paper content
Abstract: Failed to fetch summary for 2510.14538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[396] MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support
Elias Hossain, Md Mehedi Hasan Nipu, Maleeha Sheikh, Rajib Rana, Subash Neupane, Niloofar Yousefi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.16625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[397] When AI Bends Metal: AI-Assisted Optimization of Design Parameters in Sheet Metal Forming
Ahmad Tarraf, Koutaiba Kassem-Manthey, Seyed Ali Mohammadi, Philipp Martin, Lukas Moj, Semih Burak, Enju Park, Christian Terboven, Felix Wolf
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2511.22302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] The Geometry of Thought: How Scale Restructures Reasoning In Large Language Models
Samuel Cyrenius Anderson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.13358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] Learning Inter-Atomic Potentials without Explicit Equivariance
Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Hongtao Zhao, Christian Tyrchan, Eva Nittinger, Garrett M. Morris, Michael M. Bronstein
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.00027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] Distilling LLM Reasoning into Graph of Concept Predictors
Ziyang Yu, Liang Zhao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.03006: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03006&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling
Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.03638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Generative Data Transformation: From Mixed to Unified Data
Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, Enhong Chen
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.22743 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions as abstract retrieval failed
Abstract: Failed to fetch summary for 2602.22743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] Man and machine: artificial intelligence and judicial decision making
Arthur Dyevre, Ahmad Shahvaroughi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2603.19042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Empirical Comparison of Agent Communication Protocols for Task Orchestration
Ivan Dobrovolskyi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.22823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
Jindong Tian, Yifei Ding, Ronghui Xu, Hao Miao, Chenjuan Guo, Bin Yang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.09734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models
Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.22846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] UniAI-GraphRAG: Synergizing Ontology-Guided Extraction, Multi-Dimensional Clustering, and Dual-Channel Fusion for Robust Multi-Hop Reasoning
Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.25152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] A Semi-amortized Lifted Learning-to-Optimize Masked (SALLO-M) Transformer Model for Scalable and Generalizable Beamforming
Yubo Zhang, Xiao-Yang Liu, Xiaodong Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.13077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API access issuesMethod: Unable to determine method due to API access issues
Result: Unable to determine results due to API access issues
Conclusion: Unable to analyze paper due to technical issues with arXiv API access
Abstract: Failed to fetch summary for 2603.25158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] Automated Algorithm Design for Auto-Tuning Optimizers
Floris-Jan Willemsen, Niki van Stein, Ben van Werkhoven
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.17899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] ZeroFlood: Flood Hazard Mapping from Single-Modality SAR Using Geo-Foundation Models
Hyeongkyun Kim, Orestis Oikonomou
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper with ID 2510.23364 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2510.23364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] Multiverse: Language-Conditioned Multi-Game Level Blending via Shared Representation
In-Chang Baek, Jiyun Jung, Geum-Hwan Hwang, Sung-Hyun Kim, Kyung-Joong Kim
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2603.26782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Compliance-Aware Predictive Process Monitoring: A Neuro-Symbolic Approach
Fabrizio De Santis, Gyunam Park, Wil M.P. van der Aalst, Francesco Zanichelli
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.26948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange
Yin Cheng, Liao Zhou, Xiyu Liang, Dihao Luo, Tewei Lee, Kailun Zheng, Weiwei Zhang, Mingchen Cai, Jian Dong, Andy Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2603.27765 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2603.27765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis
Julio C. Serrano, Joonas Kevari, Rumy Narayan
Main category: cs.AI
TL;DR: The paper ID 2603.28336 appears to be unavailable due to HTTP 429 error (rate limiting), suggesting the arXiv API is temporarily blocking requests. No abstract content is available for analysis.
Details
Motivation: Unable to determine motivation as the paper content is not accessible due to API rate limiting.Method: Unable to determine method as the paper content is not accessible.
Result: Unable to determine results as the paper content is not accessible.
Conclusion: Unable to draw conclusions about the paper due to access limitations.
Abstract: Failed to fetch summary for 2603.28336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Provably Extracting the Features from a General Superposition
Allen Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access limitations
Result: No results available - paper content inaccessible
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.15987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] LLM-Meta-SR: In-Context Learning for Evolving Selection Operators in Symbolic Regression
Hengzhe Zhang, Qi Chen, Bing Xue, Wolfgang Banzhaf, Mengjie Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2505.18602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Balancing Efficiency and Empathy: Healthcare Providers’ Perspectives on AI-Supported Workflows for Serious Illness Conversations in the Emergency Department
Menglin Zhao, Zhuorui Yong, Ruijia Guan, Kai-Wei Chang, Adrian Haimovich, Kei Ouchi, Timothy Bickmore, Zhan Zhang, Bingsheng Yao, Dakuo Wang, Smit Desai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.00241: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00241&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
Zihe Yan, Zhuosheng Zhang, Jiaping Gui, Gongshen Liu
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2507.10610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Hellinger Multimodal Variational Autoencoders
Huyen Vo, Isabel Valera
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.06572 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable due to API rate limiting.
Result: Cannot determine results as the paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.06572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Generative Logic: A New Computer Architecture for Deterministic Reasoning and Knowledge Generation
Nikolai Sergeev
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2508.00017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] Temporal Sepsis Modeling: a Relational and Explainable-by-Design Framework
Vincent Lemaire, Nédra Meloulli, Pierre Jaquet
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2601.21747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] Generative AI on Wall Street – Opportunities and Risk Controls
Jackie Shen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2509.05841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] PAIR-Former: Budgeted Relational MIL for miRNA Target Prediction
Jiaqi Yin, Baiming Chen, Jia Fei, Mingjun Yang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.00465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Incorporating LLM Embeddings for Variation Across the Human Genome
Hongqian Niu, Jordan Bryan, Jacob Williams, Hufeng Zhou, Haoyu Zhang, Xihao Li, Didong Li
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.20702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation
Jan Ole von Hartz, Lukas Schweizer, Joschka Boedecker, Abhinav Valada
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.24956 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2509.24956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Past, Present, and Future of Bug Tracking in the Generative AI Era
Utku Boran Torun, Mehmet Taha Demircan, Mahmut Furkan Gön, Eray Tüzün
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.08005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] Local Causal Discovery for Statistically Efficient Causal Inference
Mátyás Schubert, Tom Claassen, Sara Magliacane
Main category: cs.AI
TL;DR: Unable to analyze paper 2510.14582 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2510.14582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos
Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, Yi Xu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2511.12882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language
Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.14565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation
Aleksei Liuliakov, Luca Hermes, Barbara Hammer
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.19261 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailable due to API rate limitingMethod: Cannot determine method as abstract content is unavailable due to API rate limiting
Result: Cannot determine results as abstract content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as abstract content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.19261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2512.02902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting
Alvaro Paredes Amorin, Andre Python, Christoph Weisser
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2603.09085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM-Generated Metadata to Enhance RAG Systems
Pranav Pushkar Mishra, Kranti Prakash Yeole, Ramyashree Keshavamurthy, Mokshit Bharat Surana, Fatemeh Sarayloo
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Method unknown - paper content not accessible due to arXiv API rate limiting
Result: No results available - failed to fetch paper summary
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.05411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.19576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling
Joyjit Roy, Samaresh Kumar Singh, Sushanta Das
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.14841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] Dynamic Cogeneration of Bug Reproduction Test in Agentic Program Repair
Runxiang Cheng, Michele Tufano, José Cambronero, Renyao Wei, Sherry Shi, Grant Uy, Pat Rondon, Franjo Ivančić
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2601.19066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Semantic Labeling for Third-Party Cybersecurity Risk Assessment: A Semi-Supervised Approach to Intent-Aware Question Retrieval
Ali Nour Eldin, Mohamed Sellami, Mehdi Acheli, Walid Gaaloul, Julien Steunou
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.10149 suggests it’s from February 2026, which is in the future relative to current date.
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot draw conclusions due to inability to access paper content.
Abstract: Failed to fetch summary for 2602.10149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
Tian Wen, Zhiqin Yang, Yonggang Zhang, Xuefeng Jiang, Hao Peng, Yuwei Wang, Bo Han
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.19722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis
Benjamin M. Chen, Hong Bao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.04982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning
Janaka Chathuranga Brahmanage, Akshat Kumar
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Khadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.15970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?
Eyal Weiss
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.24291 suggests it’s from March 2023, but content is unavailable.
Details
Motivation: Cannot determine motivation as the paper content could not be retrieved due to rate limiting from arXiv API.Method: Cannot determine method without access to the paper content.
Result: No results available due to failed content retrieval.
Conclusion: Unable to analyze the paper due to technical limitations in accessing the content.
Abstract: Failed to fetch summary for 2603.24291: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24291&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] InCoder-32B: Code Foundation Model for Industrial Scenarios
Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Tanvir Hossain, Muhammad Ifte Khairul Islam, Lilia Chebbah, Charles Fanning, Esra Akbas
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2603.27529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing
Edward J. Yoon
Main category: cs.AI
TL;DR: Paper ID 2603.27914 could not be fetched due to HTTP 429 error (rate limiting), so content analysis is not possible.
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting.Method: Unable to determine method as paper content could not be retrieved due to API rate limiting.
Result: Unable to determine results as paper content could not be retrieved due to API rate limiting.
Conclusion: Unable to determine conclusion as paper content could not be retrieved due to API rate limiting.
Abstract: Failed to fetch summary for 2603.27914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
Main category: cs.AI
TL;DR: Paper 2603.20340: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface
Michael Hind, Basel Shbita, Bo Wu, Farhan Ahmed, Chad DeLuca, Nathan Fulton, David Cox, Dan Gutfreund
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.22519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao
Zhi Sun, Wenming Zhang, Yi Wei, Liren Yu, Zhixuan Zhang, Dan Ou, Haihong Tang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.22779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] Robust Safety Monitoring of Language Models via Activation Watermarking
Toluwani Aremu, Daniil Ognev, Samuele Poppi, Nils Lukas
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.23171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] Enes Causal Discovery
Alexis Kafantaris
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.24436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off
Pronob Kumar Barman, Tera L. Reynolds, James Foulds
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.24750 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.24750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control
Zelin Tao, Zeran Su, Peiran Liu, Jingkai Sun, Wenqiang Que, Jiahao Ma, Jialin Yu, Jiahang Cao, Pihai Sun, Hao Liang, Gang Han, Wen Zhao, Zhiyuan Xu, Jian Tang, Qiang Zhang, Yijie Guo
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.27756 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2603.27756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning
Chang Zong, Sicheng Lv, Si-tu Xue, Huilin Zheng, Jian Wan, Lei Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2603.28325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] A Convex Route to Thermomechanics: Learning Internal Energy and Dissipation
Hagen Holthusen, Paul Steinmann, Ellen Kuhl
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.28707: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28707&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[456] IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)
Yassine El Kheir, Amit Meghanani, Mostafa Shahin, Omnia Ibrahim, Shammur Absar Chowdhury, Nada AlMarwani, Youssef Elshahawy, Ahmed Ali
Main category: cs.SD
TL;DR: The IQRA Interspeech Challenge on Arabic mispronunciation detection introduces new authentic mispronunciation data and shows significant performance improvements through diverse approaches including self-supervised learning and large audio-language models.
Details
Motivation: To advance automatic Mispronunciation Detection and Diagnosis (MDD) for Modern Standard Arabic by providing new authentic mispronunciation data and benchmarking diverse approaches to improve Arabic pronunciation assessment.Method: Introduced Iqra_Extra_IS26 dataset of authentic human mispronounced speech. Participants employed diverse approaches including CTC-based self-supervised learning models, two-stage fine-tuning strategies, and large audio-language models.
Result: Substantial improvement of 0.28 in F1-score compared to first edition, demonstrating effectiveness of novel architectures and additional authentic mispronunciation data.
Conclusion: The results show growing maturity of Arabic MDD research and establish stronger foundation for future work in Arabic pronunciation assessment.
Abstract: We present the findings of the second edition of the IQRA Interspeech Challenge, a challenge on automatic Mispronunciation Detection and Diagnosis (MDD) for Modern Standard Arabic (MSA). Building on the previous edition, this iteration introduces \textbf{Iqra_Extra_IS26}, a new dataset of authentic human mispronounced speech, complementing the existing training and evaluation resources. Submitted systems employed a diverse range of approaches, spanning CTC-based self-supervised learning models, two-stage fine-tuning strategies, and using large audio-language models. Compared to the first edition, we observe a substantial jump of \textbf{0.28 in F1-score}, attributable both to novel architectures and modeling strategies proposed by participants and to the additional authentic mispronunciation data made available. These results demonstrate the growing maturity of Arabic MDD research and establish a stronger foundation for future work in Arabic pronunciation assessment.
[457] Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models
Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar, Nishit Anand, Utkarsh Tyagi, Prem Seetharaman, Ramani Duraiswami, Dinesh Manocha
Main category: cs.SD
TL;DR: Audio Hallucination Attacks (AHA) framework exposes reliability gaps in Large Audio Language Models by testing if they genuinely ground responses in audio input, achieving high attack success rates on state-of-the-art models.
Details
Motivation: While Large Audio Language Models perform well on standard benchmarks, their reliability in real-world settings remains underexplored, particularly whether they genuinely ground responses in audio input or hallucinate based on question patterns.Method: Introduces AHA-Eval attack suite with 6.5K QA pairs targeting two attack surfaces: 1) query-based attacks exploiting question structure to induce hallucinations about absent sounds, and 2) audio-based attacks injecting synthetic speech describing non-existent events. Evaluates state-of-the-art LALMs and proposes AHA-Guard, a 120K QA post-alignment dataset for mitigation.
Result: State-of-the-art LALMs like Audio Flamingo 3 and Gemini 3 Pro show high attack success rates of 95.35% and 79.65% respectively, revealing reliability gaps hidden by standard benchmarks. AHA-Guard reduces attack success rates by up to 49%.
Conclusion: Current LALMs have significant reliability issues where they hallucinate rather than ground responses in audio input. The proposed AHA framework exposes these vulnerabilities, and AHA-Guard demonstrates effective mitigation through targeted post-alignment training.
Abstract: Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
[458] Real-Time Band-Grouped Vocal Denoising Using Sigmoid-Driven Ideal Ratio Masking
Daniel Williams
Main category: cs.SD
TL;DR: A low-latency deep learning model for real-time vocal denoising using sigmoid-driven ideal ratio mask with spectral loss and band-grouped encoder-decoder architecture with frequency attention.
Details
Motivation: Existing deep learning approaches for vocal denoising have high latency and require long context frames, making them unsuitable for live applications. There's a need for real-time solutions that preserve voice naturalness while improving SNR.Method: Proposes a sigmoid-driven ideal ratio mask trained with spectral loss to increase SNR and maximize perceptual quality. Uses band-grouped encoder-decoder architecture with frequency attention to achieve low latency (<10ms).
Result: Achieves total latency of less than 10ms with PESQ-WB improvements of 0.21 on stationary noise and 0.12 on nonstationary noise.
Conclusion: The proposed model successfully addresses latency challenges in real-time vocal denoising while maintaining good perceptual quality and SNR improvements.
Abstract: Real-time, deep learning-based vocal denoising has seen significant progress over the past few years, demonstrating the capability of artificial intelligence in preserving the naturalness of the voice while increasing the signal-to-noise ratio (SNR). However, many deep learning approaches have high amounts of latency and require long frames of context, making them difficult to configure for live applications. To address these challenges, we propose a sigmoid-driven ideal ratio mask trained with a spectral loss to encourage an increased SNR and maximized perceptual quality of the voice. The proposed model uses a band-grouped encoder-decoder architecture with frequency attention and achieves a total latency of less than 10,ms, with PESQ-WB improvements of 0.21 on stationary noise and 0.12 on nonstationary noise.
[459] LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
Detai Xin, Shujie Hu, Chengzuo Yang, Chen Huang, Guoqiao Yu, Guanglu Wan, Xunliang Cai
Main category: cs.SD
TL;DR: LongCat-AudioDiT is a non-autoregressive diffusion TTS model operating directly in waveform latent space, achieving SOTA zero-shot voice cloning with simplified pipeline and novel inference improvements.
Details
Motivation: To simplify TTS pipelines by eliminating intermediate acoustic representations like mel-spectrograms, reduce compounding errors, and improve zero-shot voice cloning performance through direct waveform latent space modeling.Method: Uses waveform variational autoencoder (Wav-VAE) for latent representation and diffusion backbone for generation, with two key inference improvements: fixing training-inference mismatch and replacing classifier-free guidance with adaptive projection guidance.
Result: Achieves SOTA zero-shot voice cloning on Seed benchmark (SIM scores: 0.818 on Seed-ZH vs 0.809 previous, 0.797 on Seed-Hard vs 0.776 previous), with competitive intelligibility despite simpler pipeline.
Conclusion: Direct waveform latent space modeling with simplified architecture can achieve superior TTS performance, and counterintuitively, better Wav-VAE reconstruction doesn’t necessarily improve overall TTS quality.
Abstract: We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
[460] A Comprehensive Corpus of Biomechanically Constrained Piano Chords: Generation, Analysis, and Implications for Voicing and Psychoacoustics
Mahesh Ramani
Main category: cs.SD
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.29710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision
Mingyeong Song, Seoyeon Ko, Junhyug Noh
Main category: cs.SD
TL;DR: SIREN is a visually-guided framework that converts monaural audio to binaural audio using visual cues, with explicit left/right channel prediction and novel attention mechanisms.
Details
Motivation: Most consumer videos have monaural audio due to capture constraints, lacking the spatial immersion of binaural audio. There's a need to convert existing monaural content to binaural format using visual guidance.Method: Uses ViT-based encoder with dual-head self-attention to produce shared scene map and L/R attention, replacing hand-crafted masks. Incorporates soft annealed spatial prior for early L/R grounding, and two-stage confidence-weighted waveform-domain fusion guided by mono reconstruction and interaural phase consistency.
Result: Consistent gains on time-frequency and phase-sensitive metrics with competitive SNR on FAIR-Play and MUSIC-Stereo datasets. Modular design requires no task-specific annotations and integrates with standard audio-visual pipelines.
Conclusion: SIREN provides an effective framework for visually-guided mono-to-binaural conversion with explicit channel prediction and attention mechanisms, offering improved spatial audio quality without specialized annotations.
Abstract: Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.
[462] Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Raj
Main category: cs.SD
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.28021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.LG
[463] OneComp: One-Line Revolution for Generative AI Model Compression
Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, Yudai Fujimoto, Hiroki Tokura, Yamato Arai, Yoshiyuki Ishii, Yusei Kawakami, Genki Shikada, Achille Jacquemond, Yoshihiko Fujisawa, Katsuki Fujisawa, Takumi Honda, Akira Sakai
Main category: cs.LG
TL;DR: OneComp is an open-source compression framework that automates post-training quantization for foundation models, providing hardware-aware mixed-precision assignments and progressive refinement stages to reduce memory footprint and latency while maintaining performance.
Details
Motivation: Deploying foundation models faces practical constraints from memory footprint, latency, and hardware costs. Post-training compression can help but implementation is challenging due to fragmented quantization algorithms, precision budgets, calibration strategies, and hardware dependencies.Method: OneComp automatically inspects models, plans mixed-precision assignments, and executes progressive quantization stages (layer-wise compression, block-wise refinement, global refinement). It treats the first quantized checkpoint as a deployable pivot, ensuring each subsequent stage improves the same model.
Result: OneComp bridges the gap between algorithmic innovation and production-grade model deployment by converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline.
Conclusion: OneComp transforms the expert compression workflow into a reproducible, resource-adaptive pipeline that enables practical deployment of foundation models by automating quantization while ensuring quality improves with additional compute investment.
Abstract: Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.
[464] Structural Pass Analysis in Football: Learning Pass Archetypes and Tactical Impact from Spatio-Temporal Tracking Data
Oktay Karakuş, Hasan Arkadaş
Main category: cs.LG
TL;DR: A framework for analyzing football passes based on their interaction with defensive structure using tracking data, with metrics quantifying how passes alter defender configurations and influence tactical progression.
Details
Motivation: Existing approaches evaluate passes primarily through outcome-based metrics like scoring probability or possession value, providing limited insight into how passes influence the opponent's defensive organization. There's a need for structural analysis of how passes disrupt defensive configurations.Method: Uses synchronized tracking/event data to derive three structural metrics: Line Bypass Score, Space Gain Metric, and Structural Disruption Index. These are combined into a composite Tactical Impact Value (TIV). Applied to 2022 FIFA World Cup data with unsupervised clustering to identify pass archetypes.
Result: Identified four interpretable pass archetypes: circulatory, destabilizing, line-breaking, and space-expanding passes. Passes with higher TIV are significantly more likely to lead to territorial progression. Revealed distinctive structural passing styles across teams and identified build-up defenders as key drivers of structural progression.
Conclusion: The proposed framework demonstrates how structural representations from tracking data can reveal interpretable tactical patterns in football, moving beyond outcome-based metrics to understand how passes influence defensive organization.
Abstract: The increasing availability of spatio-temporal tracking data has created new opportunities for analysing tactical behaviour in football. However, many existing approaches evaluate passes primarily through outcome-based metrics such as scoring probability or possession value, providing limited insight into how passes influence the defensive organisation of the opponent. This paper introduces a structural framework for analysing football passes based on their interaction with defensive structure. Using synchronised tracking/event data, we derive three complementary structural metrics, Line Bypass Score, Space Gain Metric, and Structural Disruption Index, that quantify how passes alter the spatial configuration of defenders. These metrics are combined into a composite measure termed Tactical Impact Value (TIV), which captures the structural influence of individual passes. Using tracking and event data from the 2022 FIFA World Cup, we analyse structural passing behaviour across multiple tactical levels. Unsupervised clustering of structural features reveals four interpretable pass archetypes: circulatory, destabilising, line-breaking, and space-expanding passes. Empirical results show that passes with higher TIV are significantly more likely to lead to territorial progression, particularly entries into the final third and penalty box. Spatial, team-level analyses further reveal distinctive structural passing styles across teams, while player-level analysis highlights the role of build-up defenders as key drivers of structural progression. In addition, analysing passer-receiver interactions identifies structurally impactful passing partnerships that amplify tactical progression within teams. Overall, the proposed framework demonstrates how structural representations derived from tracking data can reveal interpretable tactical patterns in football.
[465] Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training
Ivan Pasichnyk
Main category: cs.LG
TL;DR: Proposes a physics-inspired time-varying momentum schedule derived from critically damped harmonic oscillator, achieving faster convergence and enabling cross-optimizer invariant gradient attribution for identifying problematic network layers.
Details
Motivation: Standard neural network training uses constant momentum (typically 0.9) with limited theoretical justification. The paper aims to develop a principled, parameter-free momentum schedule that improves convergence and provides diagnostic capabilities for identifying problematic network layers.Method: Derives a time-varying momentum schedule from critically damped harmonic oscillator: μ(t) = 1 - 2√α(t), where α(t) is the current learning rate. This beta-schedule requires no additional parameters beyond the existing learning rate schedule. Also proposes a hybrid approach combining physics momentum for early convergence and constant momentum for final refinement.
Result: On ResNet-18/CIFAR-10: 1.9x faster convergence to 90% accuracy; gradient attribution produces cross-optimizer invariant diagnostic identifying same three problem layers regardless of optimizer (100% overlap); surgical correction of only these layers fixes 62 misclassifications while retraining only 18% of parameters; hybrid schedule reaches 95% accuracy fastest among five methods tested.
Conclusion: Main contribution is not accuracy improvement but a principled, parameter-free tool for localizing and correcting specific failure modes in trained networks. The physics-inspired momentum schedule provides both faster convergence and valuable diagnostic capabilities for network analysis.
Abstract: Standard neural network training uses constant momentum (typically 0.9), a convention dating to 1964 with limited theoretical justification for its optimality. We derive a time-varying momentum schedule from the critically damped harmonic oscillator: mu(t) = 1 - 2*sqrt(alpha(t)), where alpha(t) is the current learning rate. This beta-schedule requires zero free parameters beyond the existing learning rate schedule. On ResNet-18/CIFAR-10, beta-scheduling delivers 1.9x faster convergence to 90% accuracy compared to constant momentum. More importantly, the per-layer gradient attribution under this schedule produces a cross-optimizer invariant diagnostic: the same three problem layers are identified regardless of whether the model was trained with SGD or Adam (100% overlap). Surgical correction of only these layers fixes 62 misclassifications while retraining only 18% of parameters. A hybrid schedule – physics momentum for fast early convergence, then constant momentum for the final refinement – reaches 95% accuracy fastest among five methods tested. The main contribution is not an accuracy improvement but a principled, parameter-free tool for localizing and correcting specific failure modes in trained networks.
[466] A Neural Tension Operator for Curve Subdivision across Constant Curvature Geometries
Hassan Ugail, Newton Howard
Main category: cs.LG
TL;DR: Learned interpolatory subdivision scheme with per-edge tension prediction using a single neural network that works across Euclidean, spherical, and hyperbolic geometries without architectural changes.
Details
Motivation: Classical subdivision schemes use a single global tension parameter and require separate formulations for different geometries (Euclidean, spherical, hyperbolic). The authors aim to create a unified approach with adaptive per-edge tension prediction that works across all three spaces.Method: A 140K-parameter neural network predicts per-edge insertion angles using local intrinsic features and trainable geometry embeddings. The network drives geometry-specific insertion operators across all three spaces without architectural modification. A constrained sigmoid output enforces structural safety bounds.
Result: On 240 validation curves, the learned predictor achieves lower bending energy and angular roughness than fixed-tension and manifold-lift baselines. On out-of-distribution ISS orbital ground-track example, bending energy reduced by 41% and angular roughness by 68% with modest Hausdorff distance increase.
Conclusion: The learned per-edge tension predictor provides a unified, adaptive subdivision approach across multiple geometries, offering improved smoothness while maintaining fidelity, with theoretical guarantees for safety and convergence.
Abstract: Interpolatory subdivision schemes generate smooth curves from piecewise-linear control polygons by repeatedly inserting new vertices. Classical schemes rely on a single global tension parameter and typically require separate formulations in Euclidean, spherical, and hyperbolic geometries. We introduce a shared learned tension predictor that replaces the global parameter with per-edge insertion angles predicted by a single 140K-parameter network. The network takes local intrinsic features and a trainable geometry embedding as input, and the predicted angles drive geometry-specific insertion operators across all three spaces without architectural modification. A constrained sigmoid output head enforces a structural safety bound, guaranteeing that every inserted vertex lies within a valid angular range for any finite weight configuration. Three theoretical results accompany the method: a structural guarantee of tangent-safe insertions; a heuristic motivation for per-edge adaptivity; and a conditional convergence certificate for continuously differentiable limit curves, subject to an explicit Lipschitz constraint verified post hoc. On 240 held-out validation curves, the learned predictor occupies a distinct position on the fidelity–smoothness Pareto frontier, achieving markedly lower bending energy and angular roughness than all fixed-tension and manifold-lift baselines. Riemannian manifold lifts retain a pointwise-fidelity advantage, which this study quantifies directly. On the out-of-distribution ISS orbital ground-track example, bending energy falls by 41% and angular roughness by 68% with only a modest increase in Hausdorff distance, suggesting that the predictor generalises beyond its synthetic training distribution.
[467] Foundations of Polar Linear Algebra
Giovanni Guasti
Main category: cs.LG
TL;DR: Polar Linear Algebra framework for operator learning using spectral decomposition with polar geometry, combining linear radial and periodic angular components for improved stability, interpretability, and parallelization.
Details
Motivation: To provide a new spectral perspective on operator learning that offers better interpretability, stability, and parallelization capabilities compared to traditional spatial approaches.Method: Introduces Polar Linear Algebra framework based on polar geometry with linear radial and periodic angular components. Defines associated operators and analyzes their spectral properties, imposing self-adjoint-inspired spectral constraints for stability.
Result: Demonstrates reliable training of polar and fully spectral operators on MNIST benchmark. Shows improved stability and convergence with spectral constraints, reduced parameter count and computational complexity, and more interpretable decoupled spectral modes.
Conclusion: The polar spectral framework offers a novel conceptual approach to operator learning that enables better interpretability, stability, and parallelization, particularly suited for problems where spectral structure and parallel execution are important.
Abstract: This work revisits operator learning from a spectral perspective by introducing Polar Linear Algebra, a structured framework based on polar geometry that combines a linear radial component with a periodic angular component. Starting from this formulation, we define the associated operators and analyze their spectral properties. As a proof of feasibility, the framework is evaluated on a canonical benchmark (MNIST). Despite the simplicity of the task, the results demonstrate that polar and fully spectral operators can be trained reliably, and that imposing self-adjoint-inspired spectral constraints improves stability and convergence. Beyond accuracy, the proposed formulation leads to a reduction in parameter count and computational complexity, while providing a more interpretable representation in terms of decoupled spectral modes. By moving from a spatial to a spectral domain, the problem decomposes into orthogonal eigenmodes that can be treated as independent computational pipelines. This structure naturally exposes an additional dimension of model parallelization, complementing existing parallel strategies without relying on ad-hoc partitioning. Overall, the work offers a different conceptual lens for operator learning, particularly suited to problems where spectral structure and parallel execution are central.
[468] \texttt{ReproMIA}: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks
Chihan Huang, Huaijin Wang, Shuai Wang
Main category: cs.LG
TL;DR: ReproMIA is a novel membership inference attack framework that uses model reprogramming to amplify privacy leakage signals, achieving state-of-the-art performance across LLMs, diffusion models, and classification models.
Details
Motivation: Deep learning models' data memorization raises privacy concerns, but traditional membership inference attacks (MIAs) face computational costs from shadow model training and performance degradation under low false positive rate constraints.Method: Leverages model reprogramming as an active signal amplifier for privacy leakage, creating a unified proactive framework that induces and magnifies latent privacy footprints in model representations across diverse architectures.
Result: Outperforms state-of-the-art baselines across 10+ benchmarks, achieving significant improvements in low-FPR regimes: 5.25% AUC and 10.68% TPR@1%FPR increase for LLMs, and 3.70% and 12.40% respectively for diffusion models.
Conclusion: ReproMIA provides an efficient, transformative approach to membership inference that overcomes computational and performance limitations of traditional methods, demonstrating strong privacy auditing capabilities across diverse model types.
Abstract: The pervasive deployment of deep learning models across critical domains has concurrently intensified privacy concerns due to their inherent propensity for data memorization. While Membership Inference Attacks (MIAs) serve as the gold standard for auditing these privacy vulnerabilities, conventional MIA paradigms are increasingly constrained by the prohibitive computational costs of shadow model training and a precipitous performance degradation under low False Positive Rate constraints. To overcome these challenges, we introduce a novel perspective by leveraging the principles of model reprogramming as an active signal amplifier for privacy leakage. Building upon this insight, we present \texttt{ReproMIA}, a unified and efficient proactive framework for membership inference. We rigorously substantiate, both theoretically and empirically, how our methodology proactively induces and magnifies latent privacy footprints embedded within the model’s representations. We provide specialized instantiations of \texttt{ReproMIA} across diverse architectural paradigms, including LLMs, Diffusion Models, and Classification Models. Comprehensive experimental evaluations across more than ten benchmarks and a variety of model architectures demonstrate that \texttt{ReproMIA} consistently and substantially outperforms existing state-of-the-art baselines, achieving a transformative leap in performance specifically within low-FPR regimes, such as an average of 5.25% AUC and 10.68% TPR@1%FPR increase over the runner-up for LLMs, as well as 3.70% and 12.40% respectively for Diffusion Models.
[469] Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling
Mingju Liu, Jiaqi Yin, Alvaro Velasquez, Cunxi Yu
Main category: cs.LG
TL;DR: Hybrid CPU-GPU framework combining differentiable optimization with classical ILP solvers for combinatorial scheduling problems, achieving 10× speedup and <0.1% optimality gap.
Details
Motivation: Combinatorial scheduling problems are NP-hard and challenging to solve optimally at scale. Current ILP solvers struggle with large-scale scheduling problems, creating a need for more efficient approaches that can leverage modern hardware and machine learning techniques.Method: Uses differentiable presolving on GPU to generate high-quality partial solutions, which serve as warm-starts for commercial ILP solvers (CPLEX, Gurobi) and open-source solver HiGHS. Combines differentiable optimization with classical ILP solving in a hybrid CPU-GPU framework.
Result: Achieves up to 10× performance gain over baseline solvers, narrowing optimality gap to <0.1% across industry-scale benchmarks. Enables significantly improved early pruning compared to state-of-the-art standalone solvers.
Conclusion: First demonstration of using differentiable optimization to initialize exact ILP solvers for combinatorial scheduling, opening opportunities to integrate machine learning infrastructure with classical exact optimization methods across broader domains.
Abstract: This paper presents a hybrid CPU-GPU framework for solving combinatorial scheduling problems formulated as Integer Linear Programming (ILP). While scheduling underpins many optimization tasks in computing systems, solving these problems optimally at scale remains a long-standing challenge due to their NP-hard nature. We introduce a novel approach that combines differentiable optimization with classical ILP solving. Specifically, we utilize differentiable presolving to rapidly generate high-quality partial solutions, which serve as warm-starts for commercial ILP solvers (CPLEX, Gurobi) and rising open-source solver HiGHS. This method enables significantly improved early pruning compared to state-of-the-art standalone solvers. Empirical results across industry-scale benchmarks demonstrate up to a $10\times$ performance gain over baselines, narrowing the optimality gap to $<0.1%$. This work represents the first demonstration of utilizing differentiable optimization to initialize exact ILP solvers for combinatorial scheduling, opening new opportunities to integrate machine learning infrastructure with classical exact optimization methods across broader domains.
[470] Explainable histomorphology-based survival prediction of glioblastoma, IDH-wildtype
Jan-Philipp Redlich, Friedrich Feuerhake, Stefan Nikolin, Nadine Sarah Schaadt, Sarah Teuber-Hanselmann, Joachim Weis, Sabine Luttmann, Andrea Eberle, Christoph Buck, Timm Intemann, Pascal Birnstill, Klaus Kraywinkel, Jonas Ort, Peter Boor, André Homeyer
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.11691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] Multi-Agent LLMs for Adaptive Acquisition in Bayesian Optimization
Andrea Carbonati, Mohammadsina Almasi, Hadis Anahideh
Main category: cs.LG
TL;DR: LLMs struggle with exploration-exploitation trade-offs in optimization; multi-agent framework separates strategy control from candidate generation to improve search effectiveness.
Details
Motivation: Understanding how LLMs reason about exploration-exploitation trade-offs in optimization tasks, as current LLM-based optimization relies on implicit prompt-based reasoning that makes search behavior difficult to analyze or control compared to Bayesian Optimization's explicit acquisition functions.Method: Proposes a multi-agent framework that decomposes exploration-exploitation control: a strategy agent assigns interpretable weights to multiple search criteria (informativeness, diversity, representativeness), while a generation agent produces candidates conditioned on the resulting search policy weights.
Result: Single-agent LLM approaches suffer from cognitive overload leading to unstable search dynamics and premature convergence; the multi-agent framework with separated strategic control substantially improves LLM-mediated search effectiveness across continuous optimization benchmarks.
Conclusion: Decomposing exploration-exploitation control into strategic policy mediation and tactical candidate generation makes search decisions explicit, observable, and adjustable, addressing limitations of single-agent LLM approaches in optimization tasks.
Abstract: The exploration-exploitation trade-off is central to sequential decision-making and black-box optimization, yet how Large Language Models (LLMs) reason about and manage this trade-off remains poorly understood. Unlike Bayesian Optimization, where exploration and exploitation are explicitly encoded through acquisition functions, LLM-based optimization relies on implicit, prompt-based reasoning over historical evaluations, making search behavior difficult to analyze or control. In this work, we present a metric-level study of LLM-mediated search policy learning, studying how LLMs construct and adapt exploration-exploitation strategies under multiple operational definitions of exploration, including informativeness, diversity, and representativeness. We show that single-agent LLM approaches, which jointly perform strategy selection and candidate generation within a single prompt, suffer from cognitive overload, leading to unstable search dynamics and premature convergence. To address this limitation, we propose a multi-agent framework that decomposes exploration-exploitation control into strategic policy mediation and tactical candidate generation. A strategy agent assigns interpretable weights to multiple search criteria, while a generation agent produces candidates conditioned on the resulting search policy defined as weights. This decomposition renders exploration-exploitation decisions explicit, observable, and adjustable. Empirical results across various continuous optimization benchmarks indicate that separating strategic control from candidate generation substantially improves the effectiveness of LLM-mediated search.
[472] The Spectral Edge Thesis: A Mathematical Framework for Intra-Signal Phase Transitions in Neural Network Training
Yongzhong Xu
Main category: cs.LG
TL;DR: Spectral edge thesis: Phase transitions in neural network training (grokking, capability gains, loss plateaus) are controlled by the spectral gap of rolling-window Gram matrix of parameter updates, with gap dynamics preceding grokking events.
Details
Motivation: To understand the fundamental mechanisms behind phase transitions in neural network training (grokking, capability gains, loss plateaus) by analyzing the spectral properties of parameter updates during training.Method: Developed spectral edge thesis based on three axioms: (1) gap dynamics governed by Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (2) spectral loss decomposition linking each mode’s learning contribution to Davis-Kahan stability coefficient; (3) Gap Maximality Principle showing k* is unique dynamically privileged position. Tested across six model families (150K-124M parameters).
Result: Gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), gap position is optimizer-dependent (Muon: k*=1, AdamW: k*=2 on same model), and 19/20 quantitative predictions confirmed. Adiabatic parameter controls circuit stability.
Conclusion: The spectral gap of rolling-window Gram matrix controls neural network training phase transitions, providing a unified framework consistent with edge of stability, Tensor Programs, Dyson Brownian motion, Lottery Ticket Hypothesis, and neural scaling laws.
Abstract: We develop the spectral edge thesis: phase transitions in neural network training – grokking, capability gains, loss plateaus – are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters $P \sim 10^8$, window $W \sim 10$), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position $k^* = \mathrm{argmax}, σ_j/σ_{j+1}$. From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode’s learning contribution to its Davis–Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that $k^$ is the unique dynamically privileged position – its collapse is the only one that disrupts learning, and it sustains itself through an $α$-feedback loop requiring no assumption on the optimizer. The adiabatic parameter $\mathcal{A} = |ΔG|_F / (η, g^2)$ controls circuit stability: $\mathcal{A} \ll 1$ (plateau), $\mathcal{A} \sim 1$ (phase transition), $\mathcal{A} \gg 1$ (forgetting). Tested across six model families (150K–124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 0/24 without), the gap position is optimizer-dependent (Muon: $k^=1$, AdamW: $k^*=2$ on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.
[473] An Explicit Surrogate for Gaussian Mixture Flow Matching with Wasserstein Gap Bounds
Elham Rostami, Taous-Meriem Laleg-Kirati, Hamidou Tembine
Main category: cs.LG
TL;DR: Training-free flow matching between Gaussian mixture models using explicit velocity fields with closed-form surrogate for kinetic transport cost, analyzing approximation accuracy with local/nonlocal regimes and practical regime map.
Details
Motivation: To develop efficient methods for transporting between Gaussian mixture models without training, providing analytic expressions for transport costs that avoid computationally expensive matrix square-root operations required by exact Gaussian Wasserstein distances.Method: Constructs component-wise Gaussian paths with affine velocity fields satisfying continuity equation, yielding closed-form surrogate for pairwise kinetic transport cost. Analyzes approximation accuracy with second-order agreement in local commuting regime and explicit cubic error bound. Introduces path-splitting strategy for nonlocal regimes to localize covariance evolution.
Result: Derives simple analytic expression for surrogate cost from kinetic energy of induced flow, proves second-order agreement in local commuting regime, provides explicit cubic error bound, and develops practical regime map showing when surrogate is accurate versus when exact Gaussian Wasserstein geodesic method is preferable.
Conclusion: The paper presents a training-free flow matching approach for GMMs with analytic surrogate costs that approximate exact transport costs well in certain regimes, providing practical guidance on when to use the surrogate versus exact methods through a regime map.
Abstract: We study training-free flow matching between two Gaussian mixture models (GMMs) using explicit velocity fields that transport one mixture into the other over time. Our baseline approach constructs component-wise Gaussian paths with affine velocity fields satisfying the continuity equation, which yields to a closed-form surrogate for the pairwise kinetic transport cost. In contrast to the exact Gaussian Wasserstein cost, which relies on matrix square-root computations, the surrogate admits a simple analytic expression derived from the kinetic energy of the induced flow. We then analyze how closely this surrogate approximates the exact cost. We prove second-order agreement in a local commuting regime and derive an explicit cubic error bound in the local commuting regime. To handle nonlocal regimes, we introduce a path-splitting strategy that localizes the covariance evolution and enables piecewise application of the bound. We finally compare the surrogate with an exact construction based on the Gaussian Wasserstein geodesic and summarize the results in a practical regime map showing when the surrogate is accurate and the exact method is preferable.
[474] Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance
Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, Christos Kozyrakis
Main category: cs.LG
TL;DR: LLM agents optimize GPU kernels using a domain-specific language and speed-of-light guidance to improve search efficiency and avoid diminishing returns.
Details
Motivation: Current LLM-based GPU kernel optimization suffers from inefficient search: low-level reasoning wastes time on minor details, while high-level reasoning misses optimization opportunities. Agents also can't identify diminishing returns, leading to wasted resources.Method: Two key principles: (1) μCUTLASS - a compact domain-specific language that enables high-level reasoning while preserving optimization levers, and (2) Speed-of-Light (SOL) guidance - using first-principles performance bounds to steer search, budget trials, and detect benchmark-gaming.
Result: On 59 KernelBench problems: DSL code with GPT-5-mini achieved 1.27x speedup over PyTorch (vs 0.40x regression with low-level code). Adding SOL guidance raised this to 1.56x. SOL-guided budgeting saved 19-43% tokens while retaining ≥95% speedup, with best policy reaching 1.68x efficiency gain.
Conclusion: The combination of domain-specific language and performance-bound guidance significantly improves LLM-based GPU kernel optimization efficiency, enabling weaker models to outperform stronger baselines at lower cost while detecting problematic benchmark-gaming cases.
Abstract: Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in $μ$CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, $μ$CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.
[475] From Astronomy to Astrology: Testing the Illusion of Zodiac-Based Personality Prediction with Machine Learning
Abhinna Sundar Samantaray, Finnja Annika Fluhrer, Dhruv Saini, Omkar Charaple, Anish Kumar Singh, Dhruv Vansraj Rathore
Main category: cs.LG
TL;DR: Paper examines zodiac-based personality prediction using machine learning, finds no predictive power beyond random chance, attributes astrology’s apparent success to cognitive biases and cultural narratives.
Details
Motivation: Despite astrology's cultural influence in many societies for personality interpretation and decision-making, it lacks physical plausibility and statistical reliability. The paper aims to scientifically test zodiac-based personality prediction using controlled machine learning methods.Method: Constructed synthetic dataset with zodiac signs and personality labels from 100 human traits, each sign associated with 10 overlapping descriptors. Trained Logistic Regression, Random Forest, and neural network classifiers to predict personality from zodiac features and nuisance covariates.
Result: Predictive performance remained at or near random expectation across all experiments. Shuffled-label controls yielded comparable accuracies, confirming no meaningful predictive structure in zodiac-based systems.
Conclusion: Astrology’s apparent success stems from trait universality, category overlap, cognitive biases (Barnum effect, confirmation bias), and interpretive flexibility, not from measurable predictive power. Zodiac systems function as culturally durable narrative frameworks rather than reliable behavioral predictors.
Abstract: Astrology has long been used to interpret human personality, estimate compatibility, and guide social decision-making. Zodiac-based systems in particular remain culturally influential across much of the world, including in South Asian societies where astrological reasoning can shape marriage matching, naming conventions, ritual timing, and broader life planning. Despite this persistence, astrology has never established either a physically plausible mechanism or a statistically reliable predictive foundation. In this work, we examine zodiac-based personality prediction using a controlled machine-learning framework. We construct a synthetic dataset in which individuals are assigned zodiac signs and personality labels drawn from a shared pool of 100 broadly human traits. Each sign is associated with a subset of 10 common descriptors, intentionally overlapping with those assigned to other signs, thereby reproducing the ambiguity characteristic of practical astrological systems. We then train Logistic Regression, Random Forest, and neural-network classifiers to infer personality labels from zodiac-based features and nuisance covariates. Across all experiments, predictive performance remains at or near random expectation, while shuffled-label controls yield comparable accuracies. We argue that the apparent success of astrology arises not from measurable predictive structure, but from trait universality, category overlap, cognitive biases such as the Barnum effect and confirmation bias, and the interpretive flexibility of astrologers and pundits. We conclude that zodiac-based systems do not provide reliable information for predicting human behavior and instead function as culturally durable narrative frameworks. This paper is intended as a humorous academic exercise.
[476] A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank
Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus
Main category: cs.LG
TL;DR: Hierarchical ML framework predicts clinical trial operational success using latent risk factors from pre-initiation data, achieving high accuracy across trial phases.
Details
Motivation: Clinical trials are expensive, time-consuming, and risky, but reliable methods for predicting trial success before initiation are limited. Existing AI approaches often use unavailable data or focus on isolated metrics, limiting real-world applicability.Method: Two-stage hierarchical framework: first predicts intermediate latent operational risk factors from 180+ drug/trial features available pre-initiation, then integrates these predicted risks into downstream model to estimate operational success probability. Uses curated TrialsBank database (13,700 trials) with staged data-splitting to prevent leakage, benchmarked with XGBoost, CatBoost, and Explainable Boosting Machines.
Result: Strong out-of-sample performance across Phase I-III trials with F1-scores of 0.93, 0.92, and 0.91 respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation.
Conclusion: Clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.
Abstract: Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.
[477] ARCS: Autoregressive Circuit Synthesis with Topology-Aware Graph Attention and Spec Conditioning
Tushar Dhananjay Pathak
Main category: cs.LG
TL;DR: ARCS is an amortized analog circuit generation system that produces SPICE-simulatable designs in milliseconds using learned generators and SPICE-based ranking, achieving high simulation validity with dramatically fewer evaluations than search-based methods.
Details
Motivation: Current analog circuit design methods using search-based optimization (like genetic algorithms) require minutes and thousands of SPICE simulations per design, which is computationally expensive and slow for rapid prototyping and design-space exploration.Method: Hybrid pipeline combining two learned generators: a graph VAE for topology generation and a flow-matching model for component values, with SPICE-based ranking. Uses Group Relative Policy Optimization (GRPO) to address cross-topology reward distribution mismatch in RL, and grammar-constrained decoding for structural validity.
Result: Achieves 99.9% simulation validity across 32 topologies using only 8 SPICE evaluations (40x fewer than genetic algorithms). Single-model inference reaches 85% simulation validity in 97ms (600x faster than random search). GRPO improves simulation validity by +9.6pp over REINFORCE in only 500 RL steps.
Conclusion: ARCS provides >1000x speed advantage enabling rapid prototyping, design-space exploration, and warm-starting search methods, though it doesn’t yet match per-design quality of search-based optimization. The system demonstrates practical amortized circuit generation with learned models.
Abstract: I present ARCS, a system for amortized analog circuit generation that produces complete, SPICE-simulatable designs (topology and component values) in milliseconds rather than the minutes required by search-based methods. A hybrid pipeline combining two learned generators (a graph VAE and a flow-matching model) with SPICE-based ranking achieves 99.9% simulation validity (reward 6.43/8.0) across 32 topologies using only 8 SPICE evaluations, 40x fewer than genetic algorithms. For single-model inference, a topology-aware Graph Transformer with Best-of-3 candidate selection reaches 85% simulation validity in 97ms, over 600x faster than random search. The key technical contribution is Group Relative Policy Optimization (GRPO): I identify a critical failure mode of REINFORCE (cross-topology reward distribution mismatch) and resolve it with per-topology advantage normalization, improving simulation validity by +9.6pp over REINFORCE in only 500 RL steps (10x fewer). Grammar-constrained decoding additionally guarantees 100% structural validity by construction via topology-aware token masking. ARCS does not yet match the per-design quality of search-based optimization (5.48 vs. 7.48 reward), but its >1000x speed advantage enables rapid prototyping, design-space exploration, and warm-starting search methods (recovering 96.6% of GA quality with 49% fewer simulations).
[478] On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Zichao Wei
Main category: cs.LG
TL;DR: Neural networks can perfectly generalize on integer multiplication using a 2D grid representation with local operations, challenging the belief that multiplication requires long-range dependencies due to carry chains.
Details
Motivation: The paper challenges the conventional wisdom that integer multiplication is inherently difficult for neural networks due to long-range dependencies from carry chains. The authors argue this is not an intrinsic property but rather an artifact of the computational representation used.Method: The authors propose representing two n-bit binary integers as a 2D outer-product grid, where each step of long multiplication becomes a 3×3 local neighborhood operation. They implement this using a neural cellular automaton with only 321 learnable parameters and compare it against five alternative architectures including Transformer, Transformer+RoPE, and Mamba.
Result: The neural cellular automaton achieved perfect length generalization up to 683× the training range, while all five alternative architectures (including Transformer with 6,625 parameters and Mamba) failed under the same representation.
Conclusion: Long-range dependency in multiplication is not intrinsic but induced by computational spacetime choices. The paper suggests re-examining tasks diagnosed as requiring long-range dependencies to determine if the dependency is truly inherent or a representational artifact.
Abstract: Integer multiplication has long been considered a hard problem for neural networks, with the difficulty widely attributed to the O(n) long-range dependency induced by carry chains. We argue that this diagnosis is wrong: long-range dependency is not an intrinsic property of multiplication, but a mirage produced by the choice of computational spacetime. We formalize the notion of mirage and provide a constructive proof: when two n-bit binary integers are laid out as a 2D outer-product grid, every step of long multiplication collapses into a $3 \times 3$ local neighborhood operation. Under this representation, a neural cellular automaton with only 321 learnable parameters achieves perfect length generalization up to $683\times$ the training range. Five alternative architectures – including Transformer (6,625 params), Transformer+RoPE, and Mamba – all fail under the same representation. We further analyze how partial successes locked the community into an incorrect diagnosis, and argue that any task diagnosed as requiring long-range dependency should first be examined for whether the dependency is intrinsic to the task or induced by the computational spacetime.
[479] Realistic Market Impact Modeling for Reinforcement Learning Trading Environments
Lucas Riera Abbade, Anna Helena Reali Costa
Main category: cs.LG
TL;DR: Three Gymnasium-compatible trading environments with realistic transaction cost models based on Almgren-Chriss framework and square-root impact law, showing that cost models significantly affect RL algorithm performance and trading behavior.
Details
Motivation: Most RL trading environments use unrealistic fixed transaction costs, leading to agents learning behaviors that fail under realistic execution with market impact. Need for environments with realistic nonlinear transaction cost models.Method: Developed three trading environments (stock trading, margin trading, portfolio optimization) with pluggable cost models based on Almgren-Chriss framework and square-root impact law. Evaluated five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on NASDAQ-100 with both fixed cost baseline and AC model using Optuna hyperparameter tuning.
Result: Cost models significantly change algorithm performance and rankings across all environments. AC model produces dramatically different trading behavior (daily costs drop from $200k to $8k, turnover from 19% to 1%). Hyperparameter optimization reduces costs up to 82%. Algorithm-cost model interactions are environment-specific.
Conclusion: Realistic transaction cost modeling is crucial for RL-based trading systems, as it fundamentally changes algorithm performance and trading behavior. The released open-source environments address this gap in existing literature.
Abstract: Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments – MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization – that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG’s OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC’s drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.
[480] HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling
Jaber Jaber, Osama Jaber
Main category: cs.LG
TL;DR: HCLSM is a hierarchical causal latent state model that improves world modeling through object-centric decomposition, hierarchical temporal dynamics, and causal structure learning for better video prediction and understanding.
Details
Motivation: Current world models for video prediction have limitations: flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. There's a need for models that can properly decompose scenes, handle multi-scale temporal dynamics, and learn causal relationships.Method: Three interconnected principles: 1) Object-centric decomposition via slot attention with spatial broadcast decoding, 2) Hierarchical temporal dynamics using three-level engine (selective state space models for continuous physics, sparse transformers for discrete events, compressed transformers for abstract goals), 3) Causal structure learning through graph neural network interaction patterns. Two-stage training protocol with spatial reconstruction first, then dynamics prediction.
Result: 68M-parameter model trained on PushT robotic manipulation benchmark achieved 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. Custom Triton kernel for SSM scan delivered 38x speedup over sequential PyTorch.
Conclusion: HCLSM demonstrates improved world modeling capabilities through hierarchical causal structure, achieving better video prediction with learned object decomposition and multi-scale temporal dynamics.
Abstract: World models that predict future states from video remain limited by flat latent representations that entangle objects, ignore causal structure, and collapse temporal dynamics into a single scale. We present HCLSM, a world model architecture that operates on three interconnected principles: object-centric decomposition via slot attention with spatial broadcast decoding, hierarchical temporal dynamics through a three-level engine combining selective state space models for continuous physics, sparse transformers for discrete events, and compressed transformers for abstract goals, and causal structure learning through graph neural network interaction patterns. HCLSM introduces a two-stage training protocol where spatial reconstruction forces slot specialization before dynamics prediction begins. We train a 68M-parameter model on the PushT robotic manipulation benchmark from the Open X-Embodiment dataset, achieving 0.008 MSE next-state prediction loss with emerging spatial decomposition (SBD loss: 0.0075) and learned event boundaries. A custom Triton kernel for the SSM scan delivers 38x speedup over sequential PyTorch. The full system spans 8,478 lines of Python across 51 modules with 171 unit tests. Code: https://github.com/rightnow-ai/hclsm
[481] Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning
Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto
Main category: cs.LG
TL;DR: A method for bi-level reinforcement learning in decentralized settings using Boltzmann covariance trick for efficient hypergradient estimation from interaction samples.
Details
Motivation: Addresses decentralized bi-level RL problems where a leader agent can only observe follower's optimization outcomes but cannot intervene, requiring efficient hypergradient estimation without extensive data or complex gradient estimators.Method: Leverages Boltzmann covariance trick to derive alternative hypergradient formulation that enables efficient estimation solely from interaction samples, even with high-dimensional leader decision spaces. First method enabling hypergradient-based optimization for 2-player Markov games in decentralized settings.
Result: Demonstrates effectiveness in both discrete and continuous state tasks, highlighting impact of hypergradient updates.
Conclusion: Provides efficient solution for decentralized bi-level RL problems with practical applications in strategic decision-making like warehouse robot environment design.
Abstract: Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader’s decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower’s optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader’s objective, i.e., the gradient of the leader’s strategy that accounts for changes in the follower’s optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader’s decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader’s decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method’s effectiveness in both discrete and continuous state tasks.
[482] Efficient Bilevel Optimization with KFAC-Based Hypergradients
Disen Liao, Felix Dangel, Yaoliang Yu
Main category: cs.LG
TL;DR: KFAC-based bilevel optimization with curvature-aware hypergradients outperforms existing methods like conjugate gradient and Neumann expansions for large-scale models.
Details
Motivation: Bilevel optimization is widely used in ML but scaling requires computing hypergradients via inverse Hessian-vector products. Current approximations like one-step gradient unrolling or identity/short Neumann expansions discard curvature information, limiting performance.Method: Proposes incorporating Kronecker-factored approximate curvature (KFAC) into implicit function theorem-based algorithms to compute curvature-aware hypergradients with better performance-efficiency trade-off than Conjugate Gradient or Neumann methods.
Result: The approach consistently outperforms unrolling and shows that curvature information is valuable at scale. On models up to BERT size, KFAC provides curvature information with modest memory and runtime overhead.
Conclusion: KFAC enables efficient curvature-aware hypergradient computation for bilevel optimization at scale, demonstrating better performance than existing approximation methods across diverse tasks including meta-learning and AI safety.
Abstract: Bilevel optimization (BO) is widely applicable to many machine learning problems. Scaling BO, however, requires repeatedly computing hypergradients, which involves solving inverse Hessian-vector products (IHVPs). In practice, these operations are often approximated using crude surrogates such as one-step gradient unrolling or identity/short Neumann expansions, which discard curvature information. We build on implicit function theorem-based algorithms and propose to incorporate Kronecker-factored approximate curvature (KFAC), yielding curvature-aware hypergradients with a better performance efficiency trade-off than Conjugate Gradient (CG) or Neumann methods and consistently outperforming unrolling. We evaluate this approach across diverse tasks, including meta-learning and AI safety problems. On models up to BERT, we show that curvature information is valuable at scale, and KFAC can provide it with only modest memory and runtime overhead. Our implementation is available at https://github.com/liaodisen/NeuralBo.
[483] Quality-Controlled Active Learning via Gaussian Processes for Robust Structure-Property Learning in Autonomous Microscopy
Jawad Chowdhury, Ganesh Narasimha, Jan-Chi Yang, Yongtao Liu, Rama Vasudevan
Main category: cs.LG
TL;DR: A gated active learning framework combining curiosity-driven sampling with physics-informed quality control to handle noisy data in materials science experiments, improving Image-to-Spectrum and Spectrum-to-Image translations.
Details
Motivation: Autonomous experimental systems in materials research face limitations from low-quality, noisy data, especially in structure-property learning tasks like Im2Spec and Spec2Im translations. Standard active learning strategies often mistakenly prioritize poor-quality measurements by misinterpreting noise as uncertainty.Method: Proposes a gated active learning framework that combines curiosity-driven sampling with a physics-informed quality control filter based on Simple Harmonic Oscillator model fits. This allows automatic exclusion of low-fidelity data during acquisition while maintaining active learning benefits.
Result: The method outperforms random sampling, standard active learning, and multitask learning strategies on BEPS data from PbTiO3 thin films. It enhances both Im2Spec and Spec2Im by handling noise during training and acquisition, leading to more reliable forward and inverse predictions. Successfully deployed in real-time experiments on BiFeO3 thin films.
Conclusion: Supports a shift toward hybrid autonomy in self-driving labs where physics-informed quality assessment and active decision-making work together for more reliable scientific discovery. The framework effectively handles noisy data that typically misleads standard active learning approaches.
Abstract: Autonomous experimental systems are increasingly used in materials research to accelerate scientific discovery, but their performance is often limited by low-quality, noisy data. This issue is especially problematic in data-intensive structure-property learning tasks such as Image-to-Spectrum (Im2Spec) and Spectrum-to-Image (Spec2Im) translations, where standard active learning strategies can mistakenly prioritize poor-quality measurements. We introduce a gated active learning framework that combines curiosity-driven sampling with a physics-informed quality control filter based on the Simple Harmonic Oscillator model fits, allowing the system to automatically exclude low-fidelity data during acquisition. Evaluations on a pre-acquired dataset of band-excitation piezoresponse spectroscopy (BEPS) data from PbTiO3 thin films with spatially localized noise show that the proposed method outperforms random sampling, standard active learning, and multitask learning strategies. The gated approach enhances both Im2Spec and Spec2Im by handling noise during training and acquisition, leading to more reliable forward and inverse predictions. In contrast, standard active learners often misinterpret noise as uncertainty and end up acquiring bad samples that hurt performance. Given its promising applicability, we further deployed the framework in real-time experiments on BiFeO3 thin films, demonstrating its effectiveness in real autonomous microscopy experiments. Overall, this work supports a shift toward hybrid autonomy in self-driving labs, where physics-informed quality assessment and active decision-making work hand-in-hand for more reliable discovery.
[484] Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification
Guan Wang, Shuyin Xia, Lei Qian, Guoyin Wang, Yi Liu, Yi Wang, Wei Wang
Main category: cs.LG
TL;DR: A granular-ball graph coarsening method that reduces computational overhead for large-scale GCN training through multi-granularity subgraph sampling.
Details
Motivation: GCNs face high computational overhead on large-scale graphs, especially with deep architectures. Existing sampling and coarsening methods either ignore multi-granularity information or have high time complexity.Method: Proposes a two-stage framework: 1) Multi-granularity granular-ball graph coarsening algorithm with linear time complexity to coarsen original graph into subgraphs, 2) Random sampling of these granular-ball subgraphs to form minibatches for GCN training.
Result: Experimental results on multiple datasets demonstrate superior performance in node classification while significantly reducing graph scale and enhancing training efficiency and scalability.
Conclusion: The proposed method effectively addresses computational challenges in large-scale GCN training through efficient graph coarsening and sampling, achieving better performance than existing approaches.
Abstract: Graph Convolutional Network (GCN) is a model that can effectively handle graph data tasks and has been successfully applied. However, for large-scale graph datasets, GCN still faces the challenge of high computational overhead, especially when the number of convolutional layers in the graph is large. Currently, there are many advanced methods that use various sampling techniques or graph coarsening techniques to alleviate the inconvenience caused during training. However, among these methods, some ignore the multi-granularity information in the graph structure, and the time complexity of some coarsening methods is still relatively high. In response to these issues, based on our previous work, in this paper, we propose a new framework called Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification. Specifically, this method first uses a multi-granularity granular-ball graph coarsening algorithm to coarsen the original graph to obtain many subgraphs. The time complexity of this stage is linear and much lower than that of the exiting graph coarsening methods. Then, subgraphs composed of these granular-balls are randomly sampled to form minibatches for training GCN. Our algorithm can adaptively and significantly reduce the scale of the original graph, thereby enhancing the training efficiency and scalability of GCN. Ultimately, the experimental results of node classification on multiple datasets demonstrate that the method proposed in this paper exhibits superior performance. The code is available at https://anonymous.4open.science/r/1-141D/.
[485] Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses
Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang, Kafeng Wang, Cheng-zhong Xu, Hang Su, Jun Zhu
Main category: cs.LG
TL;DR: Dummy Classes-based defenses exploit limitations in existing adversarial robustness evaluation methods by using a “dummy” class as safety sink, achieving overestimated robustness. DAWA attack targets both true and dummy labels to properly evaluate these defenses.
Details
Motivation: Existing adversarial robustness evaluation methods like AutoAttack fail to properly assess Dummy Classes-based defenses, which introduce an additional dummy class to capture adversarial examples. This leads to significantly overestimated robustness scores, creating a need for more reliable evaluation benchmarks.Method: Proposes Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that simultaneously targets both the true label and dummy label with adaptive weighting during adversarial example synthesis. This approach breaks the defense mechanism by considering both classification targets.
Result: DAWA effectively breaks Dummy Classes-based defenses, reducing measured robustness from 58.61% to 29.52% on CIFAR-10 under l_infty perturbation (epsilon=8/255). The method provides more reliable evaluation benchmarks for this emerging defense paradigm.
Conclusion: Current adversarial robustness evaluation methods have limitations when assessing Dummy Classes-based defenses. DAWA addresses this gap and highlights the need for continuous evolution of robustness assessment methodologies as new defense paradigms emerge.
Abstract: Adversarial robustness evaluation faces a critical challenge as new defense paradigms emerge that can exploit limitations in existing assessment methods. This paper reveals that Dummy Classes-based defenses, which introduce an additional “dummy” class as a safety sink for adversarial examples, achieve significantly overestimated robustness under conventional evaluation strategies like AutoAttack. The fundamental limitation stems from these attacks’ singular focus on misleading the true class label, which aligns perfectly with the defense mechanism–successful attacks are simply captured by the dummy class. To address this gap, we propose Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that simultaneously targets both the true label and dummy label with adaptive weighting during adversarial example synthesis. Extensive experiments demonstrate that DAWA effectively breaks this defense paradigm, reducing the measured robustness of a leading Dummy Classes-based defense from 58.61% to 29.52% on CIFAR-10 under l_infty perturbation (epsilon=8/255). Our work provides a more reliable benchmark for evaluating this emerging class of defenses and highlights the need for continuous evolution of robustness assessment methodologies.
[486] IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection
Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang
Main category: cs.LG
TL;DR: IMPACT is a novel framework for open-set time series anomaly detection that uses influence modeling to generate realistic unseen anomalies and decontaminate training data.
Details
Motivation: Current open-set anomaly detection methods work well for images but fail for time series data because they don't preserve sequential nature, generating trivial/unrealistic anomalies. They also struggle when training data is contaminated with unlabeled anomalies.Method: IMPACT learns an influence function to estimate individual training samples’ impact on modeling, then uses influence scores to: 1) generate semantically divergent yet realistic unseen anomalies for time series, and 2) repurpose high-influential samples as supervised anomalies for anomaly decontamination.
Result: Extensive experiments show IMPACT significantly outperforms existing state-of-the-art methods, demonstrating superior accuracy under varying OSAD settings and contamination rates.
Conclusion: IMPACT effectively addresses the limitations of current OSAD methods for time series data by leveraging influence modeling for realistic anomaly generation and data decontamination.
Abstract: Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates.
[487] Biomimetic PINNs for Cell-Induced Phase Transitions: UQ-R3 Sampling with Causal Gating
Anci Lin, Xiaohong Liu, Zhiwen Zhang, Weidong Zhao, Wenju Zhao
Main category: cs.LG
TL;DR: Bio-PINNs: A biomimetic physics-informed neural network framework that addresses challenges in modeling cell-induced phase transitions with nonconvex multi-well energies, using progressive distance gating and deformation-uncertainty proxies to capture sharp interfaces and fine-scale microstructures.
Details
Motivation: Nonconvex multi-well energies in cell-induced phase transitions create sharp interfaces, fine-scale microstructures, and distance-dependent inter-cell coupling that challenge physics-informed learning. Existing methods often over-smooth near-field patterns, failing to capture the complex physics accurately.Method: Proposes Bio-PINNs with: 1) A variational framework encoding temporal causality into explicit spatial causality via progressive distance gate; 2) Deformation-uncertainty proxy for interfacial length scale to target microstructure-prone regions; 3) Uncertainty-driven “retain-resample-release” adaptive collocation strategy with theoretical guarantees.
Result: Bio-PINNs consistently recover sharp transition layers and tether morphologies across single- and multi-cell benchmarks, diverse separations, and various regularization regimes, significantly outperforming state-of-the-art adaptive and ungated baselines.
Conclusion: Bio-PINNs provide an effective framework for modeling complex cell-induced phase transitions with nonconvex energies, offering computational efficiency and theoretical guarantees while addressing the over-smoothing limitations of existing physics-informed learning methods.
Abstract: Nonconvex multi-well energies in cell-induced phase transitions give rise to sharp interfaces, fine-scale microstructures, and distance-dependent inter-cell coupling, all of which pose significant challenges for physics-informed learning. Existing methods often suffer from over-smoothing in near-field patterns. To address this, we propose biomimetic physics-informed neural networks (Bio-PINNs), a variational framework that encodes temporal causality into explicit spatial causality via a progressive distance gate. Furthermore, Bio-PINNs leverage a deformation-uncertainty proxy for the interfacial length scale to target microstructure-prone regions, providing a computationally efficient alternative to explicit second-derivative regularization. We provide theoretical guarantees for the resulting uncertainty-driven ``retain-resample-release" adaptive collocation strategy, which ensures persistent coverage under gating and establishing a quantitative near-to-far growth bound. Across single- and multi-cell benchmarks, diverse separations, and various regularization regimes, Bio-PINNs consistently recover sharp transition layers and tether morphologies, significantly outperforming state-of-the-art adaptive and ungated baselines.
[488] Improving Ensemble Forecasts of Abnormally Deflecting Tropical Cyclones with Fused Atmosphere-Ocean-Terrain Data
Qixiang Li, Shuwei Huo, Chong Wang, Xiaofeng Li, Yuan Zhou
Main category: cs.LG
TL;DR: A multimodal deep learning framework for tropical cyclone forecasting that integrates atmosphere, ocean, and terrain data to handle both normal and abnormal deflected cyclones.
Details
Motivation: Existing deep learning methods for TC forecasting are limited to single data types and fail to accurately predict abnormal deflected cyclones, creating a need for multimodal approaches that capture complex cross-domain interactions.Method: Created AOT-TCs dataset integrating heterogeneous atmospheric, oceanic, and terrain variables, then developed a forecasting model with explicit atmosphere-ocean-terrain coupling architecture to capture cross-domain interactions.
Result: Achieved state-of-the-art performance on Northwest Pacific TC cases (2017-2024), significantly improving accuracy for normal TCs and breaking through technical bottlenecks for abnormally deflected TCs.
Conclusion: The multimodal approach with explicit cross-domain coupling enables comprehensive TC forecasting that handles both normal and abnormal cases, representing a significant advancement over single-modality methods.
Abstract: Deep learning-based tropical cyclone (TC) forecasting methods have demonstrated significant potential and application advantages, as they feature much lower computational cost and faster operation speed than numerical weather prediction models. However, existing deep learning methods still have key limitations: they can only process a single type of sequential trajectory data or homogeneous meteorological variables, and fail to achieve accurate forecasting of abnormal deflected TCs. To address these challenges, we present two groundbreaking contributions. First, we have constructed a multimodal and multi-source dataset named AOT-TCs for TC forecasting in the Northwest Pacific basin. As the first dataset of its kind, it innovatively integrates heterogeneous variables from the atmosphere, ocean, and land, thus obtaining a comprehensive and information-rich meteorological dataset. Second, based on the AOT-TCs dataset, we propose a forecasting model that can handle both normal and abnormally deflected TCs. This is the first TC forecasting model to adopt an explicit atmosphere-ocean-terrain coupling architecture, enabling it to effectively capture complex interactions across physical domains. Extensive experiments on all TC cases in the Northwest Pacific from 2017 to 2024 show that our model achieves state-of-the-art performance in TC forecasting: it not only significantly improves the forecasting accuracy of normal TCs but also breaks through the technical bottleneck in forecasting abnormally deflected TCs.
[489] Derived Fields Preserve Fine-Scale Detail in Budgeted Neural Simulators
Wenshuo Wang, Fan Zhang
Main category: cs.LG
TL;DR: DerivOpt framework optimizes which physical fields to carry and how to allocate storage budget in neural PDE simulations to improve fine-scale fidelity under fixed storage constraints.
Details
Motivation: Fine-scale-faithful neural simulation under fixed storage budgets is challenging because existing methods focus on architectures, training objectives, or rollout strategies, but fine detail can already be lost when constructing the carried state in budgeted coarsen-quantize-decode pipelines.Method: Derived-Field Optimization (DerivOpt) is a state-design framework that chooses which physical fields are carried and how storage budget is allocated across them under a calibrated channel model, based on observation that primitive and derived fields undergo systematically different retained-band distortions under the same operator.
Result: Across PDEBench, DerivOpt improves pooled mean rollout nRMSE and delivers decisive advantage in fine-scale fidelity over strong baselines, with gains visible at input time before rollout learning begins, indicating carried state is often the dominant bottleneck under tight storage budgets.
Conclusion: Carried-state design should be treated as a first-class design axis alongside architecture, loss, and rollout strategy in budgeted neural simulation.
Abstract: Fine-scale-faithful neural simulation under fixed storage budgets remains challenging. Many existing methods reduce high-frequency error by improving architectures, training objectives, or rollout strategies. However, under budgeted coarsen-quantize-decode pipelines, fine detail can already be lost when the carried state is constructed. In the canonical periodic incompressible Navier-Stokes setting, we show that primitive and derived fields undergo systematically different retained-band distortions under the same operator. Motivated by this observation, we formulate Derived-Field Optimization (DerivOpt), a general state-design framework that chooses which physical fields are carried and how storage budget is allocated across them under a calibrated channel model. Across the full time-dependent forward subset of PDEBench, DerivOpt not only improves pooled mean rollout nRMSE, but also delivers a decisive advantage in fine-scale fidelity over a broad set of strong baselines. More importantly, the gains are already visible at input time, before rollout learning begins. This indicates that the carried state is often the dominant bottleneck under tight storage budgets. These results suggest a broader conclusion: in budgeted neural simulation, carried-state design should be treated as a first-class design axis alongside architecture, loss, and rollout strategy.
[490] Robust and Consistent Ski Rental with Distributional Advice
Jihwan Kim, Chenglin Fan
Main category: cs.LG
TL;DR: A framework for ski rental problem using distributional predictions instead of point estimates, with algorithms for both deterministic and randomized settings that balance consistency and robustness.
Details
Motivation: Classical ski rental algorithms focus on worst-case competitive ratios, while recent learning-augmented methods use point-estimate predictions. Neither fully exploits distributional predictions while maintaining robustness guarantees.Method: Established systematic framework integrating distributional advice of unknown quality. For deterministic setting: formalized problem under perfect distributional prediction, derived optimal threshold-buy day algorithm, proposed Clamp Policy for imperfect predictions. For randomized setting: characterized stopping distribution via Water-Filling Algorithm optimizing expected cost with robustness constraints.
Result: Experimental results across diverse distributions (Gaussian, geometric, bi-modal) show framework improves consistency significantly over point-prediction baselines while maintaining comparable robustness.
Conclusion: The framework successfully integrates distributional predictions into ski rental algorithms, achieving better consistency than point-prediction methods while preserving robustness guarantees.
Abstract: The ski rental problem is a canonical model for online decision-making under uncertainty, capturing the fundamental trade-off between repeated rental costs and a one-time purchase. While classical algorithms focus on worst-case competitive ratios and recent “learning-augmented” methods leverage point-estimate predictions, neither approach fully exploits the richness of full distributional predictions while maintaining rigorous robustness guarantees. We address this gap by establishing a systematic framework that integrates distributional advice of unknown quality into both deterministic and randomized algorithms. For the deterministic setting, we formalize the problem under perfect distributional prediction and derive an efficient algorithm to compute the optimal threshold-buy day. We provide a rigorous performance analysis, identifying sufficient conditions on the predicted distribution under which the expected competitive ratio (ECR) matches the classic optimal randomized bound. To handle imperfect predictions, we propose the Clamp Policy, which restricts the buying threshold to a safe range controlled by a tunable parameter. We show that this policy is both robust, maintaining good performance even with large prediction errors, and consistent, approaching the optimal performance as predictions become accurate. For the randomized setting, we characterize the stopping distribution via a Water-Filling Algorithm, which optimizes expected cost while strictly satisfying robustness constraints. Experimental results across diverse distributions (Gaussian, geometric, and bi-modal) demonstrate that our framework improves consistency significantly over existing point-prediction baselines while maintaining comparable robustness.
[491] Stochastic Dimension Implicit Functional Projections for Exact Integral Conservation in High-Dimensional PINNs
Zhangyong Liang
Main category: cs.LG
TL;DR: SDIFP framework enforces macroscopic conservation laws in neural PDE solvers using stochastic dimension implicit functional projection with Monte Carlo quadrature and doubly-stochastic gradient estimation.
Details
Motivation: Traditional methods for enforcing conservation laws in neural PDE solvers face computational challenges: discrete projections scale poorly in high dimensions, require spatial grids, incur heavy memory overhead for high-order operators, and lack convergence guarantees for non-convex conservation manifolds.Method: Proposes Stochastic Dimension Implicit Functional Projection (SDIFP) framework that applies global affine transformation to continuous network output instead of projecting discrete vectors. Uses detached Monte Carlo quadrature for integral constraints, bypassing spatial grid dependencies. Introduces doubly-stochastic unbiased gradient estimator (DS-UGE) that decouples spatial sampling from differential operator subsampling to reduce memory complexity.
Result: SDIFP provides closed-form solutions for integral constraints, mitigates sampling variance, preserves solution regularity, maintains O(1) inference efficiency, and offers a scalable mesh-free approach for solving conservative high-dimensional PDEs.
Conclusion: SDIFP framework enables efficient enforcement of macroscopic conservation laws in neural PDE solvers for high-dimensional problems, overcoming limitations of traditional discrete projection methods through stochastic functional projection and efficient gradient estimation.
Abstract: Enforcing exact macroscopic conservation laws, such as mass and energy, in neural partial differential equation (PDE) solvers is computationally challenging in high dimensions. Traditional discrete projections rely on deterministic quadrature that scales poorly and restricts mesh-free formulations like PINNs. Furthermore, high-order operators incur heavy memory overhead, and generic optimization often lacks convergence guarantees for non-convex conservation manifolds. To address this, we propose the Stochastic Dimension Implicit Functional Projection (SDIFP) framework. Instead of projecting discrete vectors, SDIFP applies a global affine transformation to the continuous network output. This yields closed-form solutions for integral constraints via detached Monte Carlo (MC) quadrature, bypassing spatial grid dependencies. For scalable training, we introduce a doubly-stochastic unbiased gradient estimator (DS-UGE). By decoupling spatial sampling from differential operator subsampling, the DS-UGE reduces memory complexity from $\mathcal{O}(M \times N_{\mathcal{L}})$ to $\mathcal{O}(N \times |\mathcal{I}|)$. SDIFP mitigates sampling variance, preserves solution regularity, and maintains $\mathcal{O}(1)$ inference efficiency, providing a scalable, mesh-free approach for solving conservative high-dimensional PDEs.
[492] Monodense Deep Neural Model for Determining Item Price Elasticity
Lakshya Garg, Sai Yaswanth, Deep Narayan Mishra, Karthik Kumaran, Anupriya Sharma, Mayank Uniyal
Main category: cs.LG
TL;DR: Proposes a novel framework for estimating item-level price elasticity using large-scale transactional data and machine learning, with a new Monodense-DL neural network architecture that outperforms other methods.
Details
Motivation: Businesses need accurate price elasticity estimation to optimize pricing strategies and revenue management, but traditional methods struggle with large-scale transactional data and lack of treatment-control settings.Method: Proposes a novel elasticity estimation framework that works without treatment-control settings, using three ML approaches: (1) Monodense-DL network (hybrid neural network with embedding, dense, and Monodense layers), (2) Double Machine Learning (DML) with regression models, and (3) Light Gradient Boosting Model (LGBM).
Result: The proposed Monodense-DL neural network model outperforms DML and LGBM methods in back-testing on multi-category retail data spanning millions of transactions.
Conclusion: The novel framework with Monodense-DL architecture provides superior item-level price elasticity estimation for large-scale transactional data, enabling better pricing strategies and revenue optimization.
Abstract: Item Price Elasticity is used to quantify the responsiveness of consumer demand to changes in item prices, enabling businesses to create pricing strategies and optimize revenue management. Sectors such as store retail, e-commerce, and consumer goods rely on elasticity information derived from historical sales and pricing data. This elasticity provides an understanding of purchasing behavior across different items, consumer discount sensitivity, and demand elastic departments. This information is particularly valuable for competitive markets and resource-constrained businesses decision making which aims to maximize profitability and market share. Price elasticity also uncovers historical shifts in consumer responsiveness over time. In this paper, we model item-level price elasticity using large-scale transactional datasets, by proposing a novel elasticity estimation framework which has the capability to work in an absence of treatment control setting. We test this framework by using Machine learning based algorithms listed below, including our newly proposed Monodense deep neural network. (1) Monodense-DL network – Hybrid neural network architecture combining embedding, dense, and Monodense layers (2) DML – Double machine learning setting using regression models (3) LGBM – Light Gradient Boosting Model We evaluate our model on multi-category retail data spanning millions of transactions using a back testing framework. Experimental results demonstrate the superiority of our proposed neural network model within the framework compared to other prevalent ML based methods listed above.
[493] Lie Generator Networks for Nonlinear Partial Differential Equations
Shafayeth Jamil, Rehan Kapadia
Main category: cs.LG
TL;DR: LGN-KM is a neural operator that learns continuous-time Koopman generators for nonlinear PDEs through a stable decomposition into conservative coupling and dissipative components, enabling interpretable spectral analysis and guaranteed stability.
Details
Motivation: Nonlinear systems governed by PDEs lack equivalent spectral theory to linear dynamical systems, making it difficult to access interpretable spectral properties and ensure stability in learned models.Method: LGN-KM lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator through decomposition L_k = S - D_k, where S is skew-symmetric (conservative coupling) and D_k is positive-definite diagonal (modal dissipation).
Result: On 2D Navier-Stokes turbulence, the method recovers known dissipation scaling and complete multi-branch dispersion relations from trajectory data alone, with independently trained models at different flow regimes recovering matched gauge-invariant spectral structure.
Conclusion: The architectural decomposition enables interpretability through direct spectral access, guaranteed long-horizon stability, continuous-time evaluation at arbitrary times, and physics-informed cross-viscosity model transfer.
Abstract: Linear dynamical systems are fully characterized by their eigenspectra, accessible directly from the generator of the dynamics. For nonlinear systems governed by partial differential equations, no equivalent theory exists. We introduce Lie Generator Network–Koopman (LGN-KM), a neural operator that lifts nonlinear dynamics into a linear latent space and learns the continuous-time Koopman generator ($L_k$) through a decomposition $L_k = S - D_k$, where $S$ is skew-symmetric representing conservative inter-modal coupling, and $D_k$ is a positive-definite diagonal encoding modal dissipation. This architectural decomposition enforces stability and enables interpretability through direct spectral access to the learned dynamics. On two-dimensional Navier–Stokes turbulence, the generator recovers the known dissipation scaling and a complete multi-branch dispersion relation from trajectory data alone with no physics supervision. Independently trained models at different flow regimes recover matched gauge-invariant spectral structure, exposing a gauge freedom in the Koopman lifting. Because the generator is provably stable, it enables guaranteed long-horizon stability, continuous-time evaluation at arbitrary time, and physics-informed cross-viscosity model transfer.
[494] From Physics to Surrogate Intelligence: A Unified Electro-Thermo-Optimization Framework for TSV Networks
Mohamed Gharib, Leonid Popryho, Inna Partin-Vaisband
Main category: cs.LG
TL;DR: A scalable electro-thermal modeling and optimization framework for TSV arrays using physics-informed GNN surrogates and analytical modeling for rapid design-space exploration.
Details
Motivation: High-density TSVs enable 3D integration but introduce signal-integrity and thermal-reliability challenges. Conventional FEM simulations are computationally prohibitive for large design-space exploration.Method: Combines physics-informed analytical modeling, graph neural network (GNN) surrogates trained on analytical data and fine-tuned with HFSS simulations, and multi-objective Pareto optimization framework.
Result: Achieves 5-10% RFE with analytical model, GNN surrogate generalizes with RFE below 2%, enables exploration of millions of TSV configurations in minutes, reduces evaluation time by >6 orders of magnitude.
Conclusion: The framework enables rapid electro-thermal co-design of TSV arrays, making exhaustive layout and geometric optimization feasible where FEM alone would be infeasible.
Abstract: High-density through-substrate vias (TSVs) enable 2.5D/3D heterogeneous integration but introduce significant signal-integrity and thermal-reliability challenges due to electrical coupling, insertion loss, and self-heating. Conventional full-wave finite-element method (FEM) simulations provide high accuracy but become computationally prohibitive for large design-space exploration. This work presents a scalable electro-thermal modeling and optimization framework that combines physics-informed analytical modeling, graph neural network (GNN) surrogates, and full-wave sign-off validation. A multi-conductor analytical model computes broadband S-parameters and effective anisotropic thermal conductivities of TSV arrays, achieving $5%-10%$ relative Frobenius error (RFE) across array sizes up to $15x15$. A physics-informed GNN surrogate (TSV-PhGNN), trained on analytical data and fine-tuned with HFSS simulations, generalizes to larger arrays with RFE below $2%$ and nearly constant variance. The surrogate is integrated into a multi-objective Pareto optimization framework targeting reflection coefficient, insertion loss, worst-case crosstalk (NEXT/FEXT), and effective thermal conductivity. Millions of TSV configurations can be explored within minutes, enabling exhaustive layout and geometric optimization that would be infeasible using FEM alone. Final designs are validated with Ansys HFSS and Mechanical, showing strong agreement. The proposed framework enables rapid electro-thermal co-design of TSV arrays while reducing per-design evaluation time by more than six orders of magnitude.
[495] LGFNet: Local-Global Fusion Network with Fidelity Gap Delta Learning for Multi-Source Aerodynamics
Qinye Zhu, Yu Xiang, Jun Zhang, Wenyong Wang
Main category: cs.LG
TL;DR: LGFNet is a neural network for aerodynamic data fusion that combines local spatial perception with global relational reasoning to capture both fine-grained flow features and long-range dependencies, using a fidelity gap delta learning strategy to bridge simulation and experimental data.
Details
Motivation: Current aerodynamic data fusion methods struggle to balance high-resolution local fidelity (e.g., shock waves) with wide-range global dependencies, often losing sharp discontinuities or failing to capture long-range topological correlations in flow structures.Method: Proposes Local-Global Fusion Network (LGFNet) with spatial perception layer (sliding window mechanism) for local features and relational reasoning layer (self-attention) for global dependencies. Introduces fidelity gap delta learning (FGDL) strategy that treats CFD data as a “low-frequency carrier” to approximate nonlinear discrepancies between simulation and experimental data.
Result: LGFNet achieves state-of-the-art performance in both accuracy and uncertainty reduction across diverse aerodynamic scenarios, demonstrating effective fusion of CFD, wind tunnel, and flight test data.
Conclusion: The proposed approach successfully addresses the dual challenge of preserving local flow discontinuities while capturing global aerodynamic trends, providing a comprehensive framework for aerodynamic data fusion across the flight envelope.
Abstract: The precise fusion of computational fluid dynamic (CFD) data, wind tunnel tests data, and flight tests data in aerodynamic area is essential for obtaining comprehensive knowledge of both localized flow structures and global aerodynamic trends across the entire flight envelope. However, existing methodologies often struggle to balance high-resolution local fidelity with wide-range global dependency, leading to either a loss of sharp discontinuities or an inability to capture long-range topological correlations. We propose Local-Global Fusion Network (LGFNet) for multi-scale feature decomposition to extract this dual-natured aerodynamic knowledge. To this end, LGFNet combines a spatial perception layer that integrates a sliding window mechanism with a relational reasoning layer based on self-attention, simultaneously reinforcing the continuity of fine-grained local features (e.g., shock waves) and capturing long-range flow information. Furthermore, the fidelity gap delta learning (FGDL) strategy is proposed to treat CFD data as a “low-frequency carrier” to explicitly approximate nonlinear discrepancies. This approach prevents unphysical smoothing while inheriting the foundational physical trends from the simulation baseline. Experiments demonstrate that LGFNet achieves state-of-the-art (SOTA) performance in both accuracy and uncertainty reduction across diverse aerodynamic scenarios.
[496] Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices
Christopher Goetze, Tim Schlippe, Daniel Lakey
Main category: cs.LG
TL;DR: Spacecraft anomaly detection optimized for edge deployment using neural architecture optimization, achieving high performance with minimal resource usage for CubeSat hardware constraints.
Details
Motivation: Spacecraft anomaly detection is critical for mission safety, but deploying sophisticated models on-board is challenging due to hardware constraints on spacecraft, particularly CubeSats with limited RAM and computational resources.Method: Investigates three approaches: forecasting & threshold, direct classification, and image classification. Uses multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset to optimize models for edge deployment, balancing detection performance with computational requirements.
Result: Forecasting & threshold achieved best performance (92.7% CEF0.5). Optimized model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to 59 KB and operations by 99.4%. Optimized models require only 0.36-6.25% of CubeSat RAM.
Conclusion: Sophisticated anomaly detection can be successfully deployed within spacecraft edge computing constraints, enabling near-instantaneous detection without exceeding hardware limitations or compromising mission safety.
Abstract: Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection – forecasting & threshold, direct classification, and image classification – and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting & threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requirements while maintaining capabilities – the optimized forecasting & threshold model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to just 59 KB and operations by 99.4%. Analysis of deployment viability shows our optimized models require just 0.36-6.25% of CubeSat RAM, making on-board anomaly detection practical even on highly constrained hardware. This research demonstrates that sophisticated anomaly detection capabilities can be successfully deployed within spacecraft edge computing constraints, providing near-instantaneous detection without exceeding hardware limitations or compromising mission safety.
[497] Finite-time analysis of Multi-timescale Stochastic Optimization Algorithms
Kaustubh Kartikey, Shalabh Bhatnagar
Main category: cs.LG
TL;DR: Finite-time analysis of zeroth-order stochastic approximation algorithms for simulation-based optimization, including gradient-based and Newton-based methods with multiple time-scales.
Details
Motivation: Previous work established asymptotic convergence of two/three time-scale stochastic optimization algorithms, but finite-time guarantees for zeroth-order settings were lacking, especially for Newton-based methods estimating both gradient and Hessian.Method: Two algorithms: 1) Two time-scale gradient-based method using zeroth-order gradient estimates, 2) Three time-scale Newton-based algorithm estimating both gradient and Hessian via zeroth-order methods. Analysis characterizes time-scale interactions and error propagation.
Result: Derived mean-squared error bounds for Hessian estimator and finite-time bound on gradient norm showing convergence to first-order stationary points. Identified step-size choices balancing error terms for near-optimal rates. Validated on Continuous Mountain Car environment.
Conclusion: Provides first finite-time guarantees for zeroth-order multi-time-scale stochastic optimization algorithms, with theoretical framework applicable to both gradient and Newton methods, validated empirically.
Abstract: We present a finite-time analysis of two smoothed functional stochastic approximation algorithms for simulation-based optimization. The first is a two time-scale gradient-based method, while the second is a three time-scale Newton-based algorithm that estimates both the gradient and the Hessian of the objective function $J$. Both algorithms involve zeroth order estimates for the gradient/Hessian. Although the asymptotic convergence of these algorithms has been established in prior work, finite-time guarantees of two-timescale stochastic optimization algorithms in zeroth order settings have not been provided previously. For our Newton algorithm, we derive mean-squared error bounds for the Hessian estimator and establish a finite-time bound on $\min\limits_{0 \le m \le T} \mathbb{E}| \nabla J(θ(m)) |^2$, showing convergence to first-order stationary points. The analysis explicitly characterizes the interaction between multiple time-scales and the propagation of estimation errors. We further identify step-size choices that balance dominant error terms and achieve near-optimal convergence rates. We also provide corresponding finite-time guarantees for the gradient algorithm under the same framework. The theoretical results are further validated through experiments on the Continuous Mountain Car environment.
[498] Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs
Yuxuan Liu, Wenchao Xu, Haozhao Wang, Zhiming He, Zhaofeng Shi, Chongyang Xu, Peichao Wang, Boyuan Zhang
Main category: cs.LG
TL;DR: SC-FSGL is a causality-inspired federated learning framework for dynamic spatio-temporal graphs that decouples transferable causal knowledge from client-specific noise through representation-level interventions and causal codebooks.
Details
Motivation: Existing federated graph learning methods assume all features are equally transferable across clients, overlooking spatial/temporal heterogeneity and client-specific knowledge in real-world dynamic graphs, leading to spurious representation entanglement and negative transfer.Method: Proposes SC-FSGL with: 1) Conditional Separation Module using client-conditioned masks for soft interventions to disentangle invariant spatio-temporal causal factors from spurious signals, and 2) Causal Codebook that clusters causal prototypes and aligns local representations via contrastive learning.
Result: Experiments on five diverse heterogeneous spatio-temporal graph datasets show SC-FSGL outperforms state-of-the-art methods.
Conclusion: The causality-inspired framework effectively addresses client heterogeneity in federated learning for dynamic spatio-temporal graphs by explicitly separating transferable causal knowledge from client-specific noise.
Abstract: Federated Graph Learning (FGL) has emerged as a powerful paradigm for decentralized training of graph neural networks while preserving data privacy. However, existing FGL methods are predominantly designed for static graphs and rely on parameter averaging or distribution alignment, which implicitly assume that all features are equally transferable across clients, overlooking both the spatial and temporal heterogeneity and the presence of client-specific knowledge in real-world graphs. In this work, we identify that such assumptions create a vicious cycle of spurious representation entanglement, client-specific interference, and negative transfer, degrading generalization performance in Federated Learning over Dynamic Spatio-Temporal Graphs (FSTG). To address this issue, we propose a novel causality-inspired framework named SC-FSGL, which explicitly decouples transferable causal knowledge from client-specific noise through representation-level interventions. Specifically, we introduce a Conditional Separation Module that simulates soft interventions through client conditioned masks, enabling the disentanglement of invariant spatio-temporal causal factors from spurious signals and mitigating representation entanglement caused by client heterogeneity. In addition, we propose a Causal Codebook that clusters causal prototypes and aligns local representations via contrastive learning, promoting cross-client consistency and facilitating knowledge sharing across diverse spatio-temporal patterns. Experiments on five diverse heterogeneity Spatio-Temporal Graph (STG) datasets show that SC-FSGL outperforms state-of-the-art methods.
[499] PRISM: PRIor from corpus Statistics for topic Modeling
Tal Ishon, Yoav Goldberg, Uri Shaham
Main category: cs.LG
TL;DR: PRISM is a corpus-intrinsic method that improves LDA topic modeling by deriving Dirichlet parameters from word co-occurrence statistics, enhancing topic coherence without external knowledge.
Details
Motivation: Current topic modeling methods often rely on external knowledge like pre-trained embeddings, which limits their applicability in emerging or resource-constrained domains where such external resources may not be available.Method: PRISM derives Dirichlet parameters directly from word co-occurrence statistics within the corpus itself, providing a corpus-intrinsic initialization for LDA without altering its generative process.
Result: Experiments on text and single cell RNA-seq data show PRISM improves topic coherence and interpretability, performing comparably to models that rely on external knowledge resources.
Conclusion: Corpus-driven initialization provides an effective alternative to external knowledge for topic modeling, making LDA more applicable in resource-constrained settings.
Abstract: Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.
[500] Hybrid Quantum-Classical Spatiotemporal Forecasting for 3D Cloud Fields
Fu Wang, Qifeng Lu, Xinyu Long, Meng Zhang, Xiaofei Yang, Weijia Cao, Xiaowen Chu
Main category: cs.LG
TL;DR: QENO: A hybrid quantum-inspired spatiotemporal forecasting framework for 3D cloud fields that outperforms classical methods by modeling nonlocal couplings through quantum enhancement blocks.
Details
Motivation: Accurate 3D cloud forecasting is challenging due to cross-layer interactions, nonlocal dependencies, and multiscale dynamics. Existing models struggle with locality bias and preserving fine cloud structures in volumetric forecasting tasks.Method: Four-component architecture: 1) classical spatiotemporal encoder for compact latent representation, 2) topology-aware quantum enhancement block for modeling nonlocal couplings in latent space, 3) dynamic fusion temporal unit for integrating quantum features with recurrent memory, and 4) decoder for reconstructing future cloud volumes.
Result: QENO outperforms ConvLSTM, PredRNN++, Earthformer, TAU, and SimVP variants on CMA-MESO 3D cloud fields, achieving MSE of 0.2038, RMSE of 0.4514, and SSIM of 0.6291 while maintaining compact parameter budget.
Conclusion: Topology-aware hybrid quantum-classical feature modeling is promising for 3D cloud structure forecasting and atmospheric Earth observation data analysis.
Abstract: Accurate forecasting of three-dimensional (3D) cloud fields is important for atmospheric analysis and short-range numerical weather prediction, yet it remains challenging because cloud evolution involves cross-layer interactions, nonlocal dependencies, and multiscale spatiotemporal dynamics. Existing spatiotemporal prediction models based on convolutions, recurrence, or attention often rely on locality-biased representations and therefore struggle to preserve fine cloud structures in volumetric forecasting tasks. To address this issue, we propose QENO, a hybrid quantum-inspired spatiotemporal forecasting framework for 3D cloud fields. The proposed architecture consists of four components: a classical spatiotemporal encoder for compact latent representation, a topology-aware quantum enhancement block for modeling nonlocal couplings in latent space, a dynamic fusion temporal unit for integrating measurement-derived quantum features with recurrent memory, and a decoder for reconstructing future cloud volumes. Experiments on CMA-MESO 3D cloud fields show that QENO consistently outperforms representative baselines, including ConvLSTM, PredRNN++, Earthformer, TAU, and SimVP variants, in terms of MSE, MAE, RMSE, SSIM, and threshold-based detection metrics. In particular, QENO achieves an MSE of 0.2038, an RMSE of 0.4514, and an SSIM of 0.6291, while also maintaining a compact parameter budget. These results indicate that topology-aware hybrid quantum-classical feature modeling is a promising direction for 3D cloud structure forecasting and atmospheric Earth observation data analysis.
[501] mtslearn: Machine Learning in Python for Medical Time Series
Zhongheng Jiang, Yuechao Zhao, Donglin Xie, Chenxi Sun, Rongchen Lu, Silu Luo, Zisheng Liang, Shenda Hong
Main category: cs.LG
TL;DR: mtslearn is an end-to-end toolkit for medical time-series data that provides unified data interface, automated parsing of multiple formats, and complete pipeline from data reading to visualization to lower barriers for clinical AI applications.
Details
Motivation: Real-world clinical time-series data is highly heterogeneous and inconsistently formatted, existing ML tools have steep learning curves and fragmented workflows, creating a gap between AI technologies and clinical applications.Method: Introduces mtslearn toolkit with unified data interface that automates parsing of wide, long, and flat data formats, provides complete pipeline from data reading and feature engineering to model training and visualization, with modular design and flexible interfaces for custom algorithms.
Result: The toolkit significantly reduces data cleaning overhead, simplifies complex data engineering tasks into few lines of code, lowers barrier for clinicians with limited programming experience, and accelerates translation of advanced algorithms into clinical practice.
Conclusion: mtslearn bridges the gap between cutting-edge AI and clinical application by providing an integrated toolkit that empowers clinicians to focus on medical hypotheses rather than technical complexities.
Abstract: Medical time-series data captures the dynamic progression of patient conditions, playing a vital role in modern clinical decision support systems. However, real-world clinical data is highly heterogeneous and inconsistently formatted. Furthermore, existing machine learning tools often have steep learning curves and fragmented workflows. Consequently, a significant gap remains between cutting-edge AI technologies and clinical application. To address this, we introduce mtslearn, an end-to-end integrated toolkit specifically designed for medical time-series data. First, the framework provides a unified data interface that automates the parsing and alignment of wide, long, and flat data formats. This design significantly reduces data cleaning overhead. Building on this, mtslearn provides a complete pipeline from data reading and feature engineering to model training and result visualization. Furthermore, it offers flexible interfaces for custom algorithms. Through a modular design, mtslearn simplifies complex data engineering tasks into a few lines of code. This significantly lowers the barrier to entry for clinicians with limited programming experience, empowering them to focus more on exploring medical hypotheses and accelerating the translation of advanced algorithms into real-world clinical practice. mtslearn is publicly available at https://github.com/PKUDigitalHealth/mtslearn.
[502] An Isotropic Approach to Efficient Uncertainty Quantification with Gradient Norms
Nils Grünefeld, Jes Frellsen, Christian Hardmeier
Main category: cs.LG
TL;DR: A method for quantifying predictive uncertainty in large language models using gradient-based approximations without training data access.
Details
Motivation: Existing uncertainty quantification methods for neural networks are either computationally expensive for large language models or require training data access, which is often unavailable for pretrained models.Method: Uses two approximations: 1) first-order Taylor expansion expressing uncertainty via gradient of prediction and parameter covariance, 2) isotropy assumption on parameter covariance. This yields epistemic uncertainty as squared gradient norm and aleatoric uncertainty as Bernoulli variance from a single forward-backward pass.
Result: Method shows strong correspondence with reference MCMC estimates on synthetic problems, improving with model size. On QA benchmarks, combined uncertainty achieves high AUROC on TruthfulQA but near chance on TriviaQA, revealing benchmark-dependent divergence in uncertainty signals.
Conclusion: The proposed lightweight method provides uncertainty estimates from pretrained models without modification, revealing that parameter-level uncertainty captures different signals than self-assessment methods, with benchmark-dependent utility.
Abstract: Existing methods for quantifying predictive uncertainty in neural networks are either computationally intractable for large language models or require access to training data that is typically unavailable. We derive a lightweight alternative through two approximations: a first-order Taylor expansion that expresses uncertainty in terms of the gradient of the prediction and the parameter covariance, and an isotropy assumption on the parameter covariance. Together, these yield epistemic uncertainty as the squared gradient norm and aleatoric uncertainty as the Bernoulli variance of the point prediction, from a single forward-backward pass through an unmodified pretrained model. We justify the isotropy assumption by showing that covariance estimates built from non-training data introduce structured distortions that isotropic covariance avoids, and that theoretical results on the spectral properties of large networks support the approximation at scale. Validation against reference Markov Chain Monte Carlo estimates on synthetic problems shows strong correspondence that improves with model size. We then use the estimates to investigate when each uncertainty type carries useful signal for predicting answer correctness in question answering with large language models, revealing a benchmark-dependent divergence: the combined estimate achieves the highest mean AUROC on TruthfulQA, where questions involve genuine conflict between plausible answers, but falls to near chance on TriviaQA’s factual recall, suggesting that parameter-level uncertainty captures a fundamentally different signal than self-assessment methods.
[503] Survival In-Context: Prior-fitted In-context Learning Tabular Foundation Model for Survival Analysis
Dmitrii Seletkov, Paul Hager, Rickmer Braren, Daniel Rueckert, Raphael Rehms
Main category: cs.LG
TL;DR: SIC is a prior-fitted in-context learning model for survival analysis pretrained on synthetic data, achieving competitive performance on real-world datasets without task-specific training.
Details
Motivation: Survival analysis faces challenges with limited data, censoring, and heterogeneous tabular covariates. Prior-fitted paradigms have shown success for classification/regression but their suitability for time-to-event modeling remains unclear.Method: Proposes a flexible survival data generation framework with explicit control over covariates and time-event distributions. Uses this prior to train Survival In-Context (SIC), a prior-fitted in-context learning model pretrained exclusively on synthetic data.
Result: SIC achieves competitive or superior performance compared to classical and deep survival models across real-world datasets, particularly in medium-sized data regimes.
Conclusion: Demonstrates the promise of prior-fitted foundation models for survival analysis, with SIC producing individualized survival predictions in a single forward pass without task-specific training.
Abstract: Survival analysis is crucial for many medical applications but remains challenging for modern machine learning due to limited data, censoring, and the heterogeneity of tabular covariates. While the prior-fitted paradigm, which relies on pretraining models on large collections of synthetic datasets, has recently facilitated tabular foundation models for classification and regression, its suitability for time-to-event modeling remains unclear. We propose a flexible survival data generation framework that defines a rich survival prior with explicit control over covariates and time-event distributions. Building on this prior, we introduce Survival In-Context (SIC), a prior-fitted in-context learning model for survival analysis that is pretrained exclusively on synthetic data. SIC produces individualized survival prediction in a single forward pass, requiring no task-specific training or hyperparameter tuning. Across a broad evaluation on real-world survival datasets, SIC achieves competitive or superior performance compared to classical and deep survival models, particularly in medium-sized data regimes, highlighting the promise of prior-fitted foundation models for survival analysis. The code will be made available upon publication.
[504] Why not to use Cosine Similarity between Label Representations
Beatrix M. G. Nielsen
Main category: cs.LG
TL;DR: Cosine similarity between label representations in softmax classifiers doesn’t predict model probabilities - models can be transformed to have identical probabilities but arbitrary cosine similarity between labels.
Details
Motivation: Researchers often use cosine similarity to analyze neural network representations, assuming it reveals something about model behavior. This paper investigates whether cosine similarity between label representations (unembeddings) in softmax classifiers actually correlates with the probabilities assigned by the model.Method: The paper provides mathematical proofs showing that for any softmax classifier (image classifier or autoregressive language model), given two label representations, it’s possible to construct another model with identical probabilities for all labels and inputs but where the cosine similarity between representations is either 1 or -1. The authors demonstrate specific examples and transformations.
Result: The paper proves that cosine similarity between label representations provides no information about model probabilities. Even after centering representations or fixing their lengths, there’s no guarantee that high/low cosine similarity corresponds to high/low probability for the same inputs.
Conclusion: Cosine similarity values between label representations in softmax classifiers should not be used to explain model probabilities, as they don’t reliably indicate behavioral similarity between labels.
Abstract: Cosine similarity is often used to measure the similarity of vectors. These vectors might be the representations of neural network models. However, it is not guaranteed that cosine similarity of model representations will tell us anything about model behaviour. In this paper we show that when using a softmax classifier, be it an image classifier or an autoregressive language model, measuring the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that for any softmax classifier model, given two label representations, it is possible to make another model which gives the same probabilities for all labels and inputs, but where the cosine similarity between the representations is now either 1 or -1. We give specific examples of models with very high or low cosine simlarity between representations and show how to we can make equivalent models where the cosine similarity is now -1 or 1. This translation ambiguity can be fixed by centering the label representations, however, labels with representations with low cosine similarity can still have high probability for the same inputs. Fixing the length of the representations still does not give a guarantee that high(or low) cosine similarity will give high(or low) probability to the labels for the same inputs. This means that when working with softmax classifiers, cosine similarity values between label representations should not be used to explain model probabilities.
[505] Target-Aligned Reinforcement Learning
Leonard S. Pleiss, James Harrison, Maximilian Schiffer
Main category: cs.LG
TL;DR: TARL improves RL by focusing updates on transitions where target and online network estimates are well-aligned, addressing the stability-recency tradeoff in target networks.
Details
Motivation: Target networks in RL provide stability but create a fundamental tradeoff: slower updates improve stability but reduce learning signal recency, hindering convergence speed.Method: Proposes Target-Aligned Reinforcement Learning (TARL) framework that emphasizes transitions where target and online network estimates are highly aligned, focusing updates on well-aligned targets to mitigate stale target effects while retaining stability benefits.
Result: Theoretical analysis shows target alignment correction accelerates convergence, and empirical results demonstrate consistent improvements over standard RL algorithms across various benchmark environments.
Conclusion: TARL effectively addresses the stability-recency tradeoff in target networks by focusing on aligned transitions, improving both stability and convergence speed in reinforcement learning.
Abstract: Many reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a framework that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We provide a theoretical analysis demonstrating that target alignment correction accelerates convergence, and empirically demonstrate consistent improvements over standard reinforcement learning algorithms across various benchmark environments.
[506] Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems
David Gonzalez, Alba Muixi, Beatriz Moya, Elias Cueto
Main category: cs.LG
TL;DR: A variational graph neural network (VGNN) architecture that integrates variational layers to model weight probability distributions for uncertainty quantification in computational mechanics problems.
Details
Motivation: Deep learning in computational mechanics needs reliable uncertainty quantification for critical applications like Digital Twins, especially for inverse problems where solutions may not be unique or data may be noisy.Method: Proposes a variational graph neural network (VGNN) with variational layers strategically placed only in the decoder to model weight probability distributions, enabling estimation of cognitive and statistical uncertainty at lower computational cost than full Bayesian networks.
Result: Validated on two solid mechanics cases: elastic modulus identification in 2D elastic problems and load location/quantification in 3D hyperelastic beams. The model accurately recovers physical parameters and provides physically consistent confidence intervals.
Conclusion: The VGNN architecture successfully provides uncertainty quantification for computational mechanics problems while maintaining computational efficiency, making it suitable for critical applications requiring reliable predictions with confidence measures.
Abstract: The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment.
[507] Capturing Multivariate Dependencies of EV Charging Events: From Parametric Copulas to Neural Density Estimation
Martin Výboh, Gabriela Grmanová
Main category: cs.LG
TL;DR: First application of Vine copulas and CODINE framework for EV charging event modeling, outperforming traditional methods in capturing complex dependencies between arrival times, durations, and energy demand.
Details
Motivation: Traditional statistical methods for EV charging modeling fail to capture complex non-linear dependencies between key variables (arrival times, durations, energy demand), which is crucial for grid reliability and smart-charging design.Method: Introduces Vine copulas and Copula Density Neural Estimation (CODINE) framework to model joint dependence structure of EV charging variables, evaluated across three diverse real-world datasets.
Result: Vine copulas and CODINE outperform established parametric families and remain competitive against state-of-the-art benchmarks like conditional Gaussian Mixture Model Networks, with superior preservation of tail behaviors and correlation structures.
Conclusion: These methods provide a robust framework for synthetic charging event generation in varied infrastructure contexts by explicitly focusing on modeling the joint dependence structure.
Abstract: Accurate event-based modeling of electric vehicle (EV) charging is essential for grid reliability and smart-charging design. While traditional statistical methods capture marginal distributions, they often fail to model the complex, non-linear dependencies between charging variables, specifically arrival times, durations, and energy demand. This paper addresses this gap by introducing the first application of Vine copulas and Copula Density Neural Estimation framework (CODINE) to the EV domain. We evaluate these high-capacity dependence models across three diverse real-world datasets. Our results demonstrate that by explicitly focusing on modeling the joint dependence structure, Vine copulas and CODINE outperform established parametric families and remain highly competitive against state-of-the-art benchmarks like conditional Gaussian Mixture Model Networks. We show that these methods offer superior preservation of tail behaviors and correlation structures, providing a robust framework for synthetic charging event generation in varied infrastructure contexts.
[508] Total Variation Guarantees for Sampling with Stochastic Localization
Jakob Kellermann
Main category: cs.LG
TL;DR: Theoretical convergence analysis of SLIPS, a diffusion-based sampling algorithm using Stochastic Localization, establishing first rigorous guarantees for sampling from probability measures with accessible unnormalized densities.
Details
Motivation: Motivated by success of score-based generative models and recent diffusion-based algorithms for sampling, the paper addresses the lack of rigorous convergence analysis for SLIPS (Stochastic Localization-based sampling algorithm) despite its strong empirical performance.Method: Provides theoretical analysis of SLIPS algorithm using techniques from score-based generative models theory. Establishes convergence guarantees in total variation distance under minimal assumptions on target distribution.
Result: First convergence guarantee for SLIPS showing number of steps scales linearly with dimension (up to logarithmic factors) to achieve ε-accuracy. Analysis provides theoretical insights into empirically observed optimal choice of discretization points.
Conclusion: The work closes theoretical gap for SLIPS algorithm, providing rigorous convergence analysis that explains its empirical success and offers insights into optimal parameter choices for diffusion-based sampling methods.
Abstract: Motivated by the success of score-based generative models, a number of diffusion-based algorithms have recently been proposed for the problem of sampling from a probability measure whose unnormalized density can be accessed. Among them, Grenioux et al. introduced SLIPS, a sampling algorithm based on Stochastic Localization. While SLIPS exhibits strong empirical performance, no rigorous convergence analysis has previously been provided. In this work, we close this gap by establishing the first guarantee for SLIPS in total variation distance. Under minimal assumptions on the target, our bound implies that the number of steps required to achieve an $\varepsilon$-guarantee scales linearly with the dimension, up to logarithmic factors. The analysis leverages techniques from the theory of score-based generative models and further provides theoretical insights into the empirically observed optimal choice of discretization points.
[509] The Geometry of Polynomial Group Convolutional Neural Networks
Yacoub Hendi, Daniel Persson, Magdalena Larfors
Main category: cs.LG
TL;DR: PGCNNs for finite groups using graded group algebras framework with two parametrizations (Hadamard/Kronecker), dimension analysis of neuromanifold, and fiber structure description.
Details
Motivation: To develop a mathematical framework for polynomial group convolutional neural networks for arbitrary finite groups, providing theoretical understanding of their architecture and parameter spaces.Method: Introduces graded group algebras framework for PGCNNs, yielding two natural parametrizations (Hadamard and Kronecker products). Analyzes neuromanifold dimension and describes fiber structure of Kronecker parametrization.
Result: Dimension of neuromanifold depends only on number of layers and group size. General fiber of Kronecker parametrization described up to regular group action and rescaling. Conjecture made for Hadamard parametrization supported by small group computations.
Conclusion: Provides rigorous mathematical framework for PGCNNs with theoretical insights into parameter spaces and architecture properties for arbitrary finite groups.
Abstract: We study polynomial group convolutional neural networks (PGCNNs) for an arbitrary finite group $G$. In particular, we introduce a new mathematical framework for PGCNNs using the language of graded group algebras. This framework yields two natural parametrizations of the architecture, based on Hadamard and Kronecker products, related by a linear map. We compute the dimension of the associated neuromanifold, verifying that it depends only on the number of layers and the size of the group. We also describe the general fiber of the Kronecker parametrization up to the regular group action and rescaling, and conjecture the analogous description for the Hadamard parametrization. Our conjecture is supported by explicit computations for small groups and shallow networks.
[510] Disentangled Graph Prompting for Out-Of-Distribution Detection
Cheng Yang, Yu Hao, Qi Zhang, Chuan Shi
Main category: cs.LG
TL;DR: Disentangled Graph Prompting (DGP) improves graph out-of-distribution detection using pre-trained GNNs and prompt generators to capture fine-grained in-distribution patterns without OOD supervision.
Details
Motivation: Deep neural networks face safety risks when test data differs from training distributions. Existing graph OOD detection methods lack explicit OOD supervision during training, leading to suboptimal performance of end-to-end encoders.Method: Proposes Disentangled Graph Prompting (DGP) using pre-trained GNN encoders with two prompt generators: class-specific and class-agnostic prompt graphs that modify edge weights. Includes effective losses to train prompt generators and prevent trivial solutions.
Result: Achieves 3.63% relative AUC improvement over best graph OOD detection baselines on ten datasets. Ablation studies confirm effectiveness of the approach.
Conclusion: DGP effectively addresses the lack of OOD supervision in graph OOD detection by leveraging pre-trained encoders and prompt generators to capture fine-grained ID patterns.
Abstract: When testing data and training data come from different distributions, deep neural networks (DNNs) will face significant safety risks in practical applications. Therefore, out-of-distribution (OOD) detection techniques, which can identify OOD samples at test time and alert the system, are urgently needed. Existing graph OOD detection methods usually characterize fine-grained in-distribution (ID) patterns from multiple perspectives, and train end-to-end graph neural networks (GNNs) for prediction. However, due to the unavailability of OOD data during training, the absence of explicit supervision signals could lead to sub-optimal performance of end-to-end encoders. To address this issue, we follow the pre-training+prompting paradigm to utilize pre-trained GNN encoders, and propose Disentangled Graph Prompting (DGP), to capture fine-grained ID patterns with the help of ID graph labels. Specifically, we design two prompt generators that respectively generate class-specific and class-agnostic prompt graphs by modifying the edge weights of an input graph. We also design several effective losses to train the prompt generators and prevent trivial solutions. We conduct extensive experiments on ten datasets to demonstrate the superiority of our proposed DGP, which achieves a relative AUC improvement of 3.63% over the best graph OOD detection baseline. Ablation studies and hyper-parameter experiments further show the effectiveness of DGP. Code is available at https://github.com/BUPT-GAMMA/DGP.
[511] Concept frustration: Aligning human concepts and machine representations
Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac, Alessandro Barp, Christopher R. S. Banerji
Main category: cs.LG
TL;DR: A geometric framework for comparing human concepts with unsupervised representations from foundation models, formalizing “concept frustration” - contradictions when unobserved concepts induce inconsistent relationships between known concepts within existing ontologies.
Details
Motivation: The central challenge of aligning human-interpretable concepts with internal representations learned by modern ML systems, particularly for interpretable AI. The paper aims to develop principled methods for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning.Method: Develops task-aligned similarity measures to detect concept frustration between supervised concept-based models and unsupervised representations from foundation models. Uses a geometric framework and linear-Gaussian generative model to derive closed-form expressions for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown, and unknown-unknown contributions.
Result: Shows that concept frustration is detectable in foundation model representations using task-aligned geometry (while conventional Euclidean comparisons fail). Experiments on synthetic data and real language/vision tasks demonstrate that incorporating frustrating concepts reorganizes geometry of learned concept representations to better align human and machine reasoning.
Conclusion: Provides a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for developing and validating safe interpretable AI for high-risk applications.
Abstract: Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.
[512] A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
Lixin Xiu, Xufang Luo, Hideki Nakayama
Main category: cs.LG
TL;DR: A framework using partial information decomposition (PID) to quantitatively measure the information spectrum of large vision-language models, revealing their internal decision-making processes and distinguishing between true multimodal fusion vs. unimodal priors.
Details
Motivation: Large vision-language models achieve impressive performance, but their internal decision-making processes remain opaque. It's difficult to determine if success comes from true multimodal fusion or reliance on unimodal priors, creating an attribution gap that needs to be addressed.Method: Introduces a novel framework using partial information decomposition (PID) to quantitatively measure the “information spectrum” of LVLMs. Adapts a scalable estimator to modern LVLM outputs, creating a model-agnostic pipeline that profiles 26 LVLMs on four datasets across three dimensions: breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training).
Result: Analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). Also uncovers a consistent three-phase pattern in layer-wise processing and identifies visual instruction tuning as the key stage where fusion is learned.
Conclusion: The framework provides a quantitative lens beyond accuracy-only evaluation and offers insights for analyzing and designing the next generation of LVLMs, addressing the attribution gap in understanding multimodal fusion.
Abstract: Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the “information spectrum” of LVLMs – decomposing a model’s decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions – breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .
[513] Mind the Gap: A Framework for Assessing Pitfalls in Multimodal Active Learning
Dustin Eisenhardt, Yunhee Jeong, Florian Buettner
Main category: cs.LG
TL;DR: Multimodal active learning faces unique challenges like missing modalities and modality difficulty differences, leading to imbalanced representations where models rely on one modality while neglecting others.
Details
Motivation: Active learning in multimodal settings faces distinct challenges not present in unimodal cases, including missing modalities, differences in modality difficulty, and varying interaction structures. Current understanding of how active learning strategies behave under these multimodal conditions is limited.Method: Introduces a new benchmarking framework for multimodal active learning using synthetic datasets to isolate specific pitfalls without confounding noise. Compares unimodal and multimodal query strategies and validates findings on two real-world datasets.
Result: Models consistently develop imbalanced representations, relying primarily on one modality while neglecting others. Existing query methods do not mitigate this effect, and multimodal strategies do not consistently outperform unimodal ones.
Conclusion: Highlights limitations of current active learning methods and underlines the need for modality-aware query strategies that explicitly address multimodal-specific pitfalls like missing modalities and modality difficulty differences.
Abstract: Multimodal learning enables neural networks to integrate information from heterogeneous sources, but active learning in this setting faces distinct challenges. These include missing modalities, differences in modality difficulty, and varying interaction structures. These are issues absent in the unimodal case. While the behavior of active learning strategies in unimodal settings is well characterized, their behavior under such multimodal conditions remains poorly understood. We introduce a new framework for benchmarking multimodal active learning that isolates these pitfalls using synthetic datasets, allowing systematic evaluation without confounding noise. Using this framework, we compare unimodal and multimodal query strategies and validate our findings on two real-world datasets. Our results show that models consistently develop imbalanced representations, relying primarily on one modality while neglecting others. Existing query methods do not mitigate this effect, and multimodal strategies do not consistently outperform unimodal ones. These findings highlight limitations of current active learning methods and underline the need for modality-aware query strategies that explicitly address these pitfalls. Code and benchmark resources will be made publicly available.
[514] Nonnegative Matrix Factorization in the Component-Wise L1 Norm for Sparse Data
Giovanni Seraghiti, Kévin Dubrulle, Arnaud Vandaele, Nicolas Gillis
Main category: cs.LG
TL;DR: L1-NMF with weighted median algorithm for sparse data handling, showing NP-hardness, sparsity enforcement, and efficient scaling with nonzeros.
Details
Motivation: Standard NMF using least squares is sensitive to heavy-tailed noise and outliers. L1-NMF is more robust but NP-hard and may produce overly sparse solutions for data with false zeros.Method: Proposes weighted L1-NMF (wL1-NMF) with sparsity control via penalization on entries associated with zeros in data. Develops sparse coordinate descent (sCD) algorithm using weighted median for subproblems.
Result: L1-NMF is NP-hard even for r=1. wL1-NMF effectively controls sparsity. sCD scales with number of nonzero entries, making it efficient for large sparse datasets.
Conclusion: wL1-NMF with sCD algorithm provides robust factorization for sparse data with heavy-tailed noise, balancing sparsity control and computational efficiency.
Abstract: Nonnegative matrix factorization (NMF) approximates a nonnegative matrix, $X$, by the product of two nonnegative factors, $WH$, where $W$ has $r$ columns and $H$ has $r$ rows. In this paper, we consider NMF using the component-wise L1 norm as the error measure (L1-NMF), which is suited for data corrupted by heavy-tailed noise, such as Laplace noise or salt and pepper noise, or in the presence of outliers. Our first contribution is an NP-hardness proof for L1-NMF, even when $r=1$, in contrast to the standard NMF that uses least squares. Our second contribution is to show that L1-NMF strongly enforces sparsity in the factors for sparse input matrices, thereby favoring interpretability. However, if the data is affected by false zeros, too sparse solutions might degrade the model. Our third contribution is a new, more general, L1-NMF model for sparse data, dubbed weighted L1-NMF (wL1-NMF), where the sparsity of the factorization is controlled by adding a penalization parameter to the entries of $WH$ associated with zeros in the data. The fourth contribution is a new coordinate descent (CD) approach for wL1-NMF, denoted as sparse CD (sCD), where each subproblem is solved by a weighted median algorithm. To the best of our knowledge, sCD is the first algorithm for L1-NMF whose complexity scales with the number of nonzero entries in the data, making it efficient in handling large-scale, sparse data. We perform extensive numerical experiments on synthetic and real-world data to show the effectiveness of our new proposed model (wL1-NMF) and algorithm (sCD).
[515] One-for-All: A Lightweight Stabilized and Parameter-Efficient Pre-trained LLM for Time Series Forecasting
Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan
Main category: cs.LG
TL;DR: One-for-All introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) for parameter-efficient fine-tuning of frozen LLMs for multivariate time-series analysis, achieving state-of-the-art efficiency-accuracy trade-offs.
Details
Motivation: Pre-trained LLMs face prohibitive computational and memory demands when adapted for multivariate time-series analysis, limiting their deployment in practical applications like healthcare, finance, and environmental monitoring.Method: Introduces rsLoRA with mathematically grounded rank-stabilization mechanism for provable gradient stability at low ranks. Injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers while keeping self-attention weights fixed.
Result: Achieves 6.8-21× parameter reduction vs. SOTA models, 168-1,776× smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB), 5.5× higher parameter efficiency than TimesNet while matching forecasting accuracy (MSE=0.33).
Conclusion: One-for-All enables efficient deployment of LLMs for time-series analysis on edge devices without compromising performance, validated across diverse datasets and horizons.
Abstract: We address the challenge of adapting pre-trained Large Language Models (LLMs) for multivariate time-series analysis, where their deployment is often hindered by prohibitive computational and memory demands. Our solution, One-for-All, introduces Gaussian Rank-Stabilized Low-Rank Adapters (rsLoRA) to enable parameter-efficient fine-tuning of frozen LLMs. While inspired by LoRA, rsLoRA introduces a mathematically grounded rank-stabilization mechanism that enables provable gradient stability at low ranks a novel contribution absent in prior PEFT methods. Our framework injects trainable rank decomposition matrices (rank 16) into positional embeddings and output layers, while keeping self-attention weights fixed. This design reduces trainable parameters by 6.8$\times$ (vs. TimesNet), 21$\times$ (vs. GPT4TS), and 11.8$\times$ (vs. TIME-LLM), while achieving a 168-1,776$\times$ smaller memory footprint (2.2MiB vs. 340MiB-4.18GiB in SOTA models). Rigorous evaluation across six time-series tasks demonstrates that One-for-All achieves state-of-the-art efficiency-accuracy trade-offs: 5.5$\times$ higher parameter efficiency (MSE=5.50) than TimesNet and 21$\times$ better than GPT4TS, while matching their forecasting accuracy (MSE=0.33). The framework’s stability is validated through consistent performance across diverse horizons (96-720 steps) and datasets (ETT, Weather, M3, M4), with 98.3% fewer parameters than conventional transformers. These advances enable deployment on edge devices for healthcare, finance, and environmental monitoring without compromising performance.
[516] Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy
Main category: cs.LG
TL;DR: DUME (Dynamic Upcycling MoE) reuses pre-trained dense domain experts to create a unified Mixture of Experts model without additional training, preserving original capabilities while enabling dynamic expert addition.
Details
Motivation: Training LLMs is expensive, and they lack domain expertise. Expertise finetuning causes overspecialization, while multitask training suffers from interference and forgetting. Existing MoE approaches still require multitask finetuning.Method: DUME reuses dense experts trained on different domains to construct a unified MoE model using ridge regression’s closed-form solution, eliminating optimization needs and enabling dynamic expert addition.
Result: DUME outperforms baselines in causal language modeling and reasoning, retaining 97.6% of dense expert performance in one domain and achieving 102.1% in reasoning settings.
Conclusion: DUME provides a cost-efficient, scalable approach to create multitask models that preserve domain expertise without additional training, with potential for further fine-tuning improvements.
Abstract: Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model’s original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.
[517] Big2Small: A Unifying Neural Network Framework for Model Compression
Jing-Xiao Liao, Haoran Wang, Tao Li, Daoming Lyu, Yi Zhang, Chengjun Cai, Feng-Lei Fan
Main category: cs.LG
TL;DR: Big2Small: A data-free model compression framework that uses Implicit Neural Representations (INRs) to encode large model weights into compact INRs, achieving competitive compression with mathematical unification.
Details
Motivation: To elevate model compression from fragmented heuristics to a principled discipline by establishing a unifying mathematical framework based on measure theory, and to develop a data-free compression method that doesn't require original training data.Method: Proposes Big2Small framework that trains compact Implicit Neural Representations (INRs) to encode weights of larger models. Uses Outlier-Aware Preprocessing to handle extreme weight values and Frequency-Aware Loss to preserve high-frequency details. Based on mathematical equivalence showing each compression technique is equivalent to neural network regularization.
Result: Experiments on image classification and segmentation show competitive accuracy and compression ratios compared to state-of-the-art baselines. The framework successfully compresses models without requiring original training data.
Conclusion: Big2Small provides a principled, data-free approach to model compression using INRs, with mathematical unification of various compression techniques through measure theory. The method demonstrates practical effectiveness while advancing theoretical understanding.
Abstract: With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textit{Big2Small}, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textit{Big2Small} trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textit{Big2Small} achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.
[518] Multimodal Machine Learning for Early Prediction of Metastasis in a Swedish Multi-Cancer Cohort
Franco Rugolon, Korbinian Randl, Braslav Jovanovic, Ioanna Miliou, Panagiotis Papapetrou
Main category: cs.LG
TL;DR: A multimodal machine learning framework for predicting cancer metastasis risk using EHR data, comparing traditional vs deep learning models and various fusion strategies across four cancer types.
Details
Motivation: To develop a holistic patient status assessment by integrating structured and unstructured EHR data for predicting metastasis risk one month prior to diagnosis, addressing the need for early cancer progression detection.Method: Used EHR data from four cancer cohorts (breast, colon, lung, prostate) including demographics, comorbidities, lab results, medications, and clinical text. Compared traditional and deep learning classifiers across single modalities and multimodal combinations with various fusion strategies (early, intermediate, late). Employed TRIPOD 2a design with 80-20 development-validation split and used SHAP for interpretability analysis.
Result: Intermediate fusion achieved highest F1 scores for breast (0.845), colon (0.786), and prostate (0.845) cancers. For lung cancer, text-only model performed best (0.829 F1). Deep learning consistently outperformed traditional models. Colon cancer (smallest cohort) had lowest performance, highlighting data quantity importance. SHAP showed modality importance varied by cancer type.
Conclusion: Multimodal approaches with intermediate fusion generally perform best for metastasis prediction, but optimal strategy depends on data characteristics and clinical needs. Deep learning outperforms traditional methods, and sufficient training data is crucial for model performance.
Abstract: Multimodal Machine Learning offers a holistic view of a patient’s status, integrating structured and unstructured data from electronic health records (EHR). We propose a framework to predict metastasis risk one month prior to diagnosis, using six months of clinical history from EHR data. Data from four cancer cohorts collected at Karolinska University Hospital (Stockholm, Sweden) were analyzed: breast (n = 743), colon (n = 387), lung (n = 870), and prostate (n = 1890). The dataset included demographics, comorbidities, laboratory results, medications, and clinical text. We compared traditional and deep learning classifiers across single modalities and multimodal combinations, using various fusion strategies and a Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 2a design, with an 80-20 development-validation split to ensure a rigorous, repeatable evaluation. Performance was evaluated using AUROC, AUPRC, F1 score, sensitivity, and specificity. We then employed a multimodal adaptation of SHAP to analyze the classifiers’ reasoning. Intermediate fusion achieved the highest F1 scores on breast (0.845), colon (0.786), and prostate cancer (0.845), demonstrating strong predictive performance. For lung cancer, the intermediate fusion achieved an F1 score of 0.819, while the text-only model achieved the highest, with an F1 score of 0.829. Deep learning classifiers consistently outperformed traditional models. Colon cancer, the smallest cohort, had the lowest performance, highlighting the importance of sufficient training data. SHAP analysis showed that the relative importance of modalities varied across cancer types. Fusion strategies offer distinct strengths and weaknesses. Intermediate fusion consistently delivered the best results, but strategy choices should align with data characteristics and organizational needs.
[519] From Density Matrices to Phase Transitions in Deep Learning: Spectral Early Warnings and Interpretability
Max Hennick, Guillaume Corlouer
Main category: cs.LG
TL;DR: The paper introduces the “2-datapoint reduced density matrix” (2RDM) as a tool to study emergent capabilities and phase transitions during AI model training, inspired by quantum chemistry methods.
Details
Motivation: To develop a computationally efficient, unified observable for predicting and understanding emergent capabilities and phase transitions during AI model training, addressing a key problem in modern AI research.Method: Proposes the 2-datapoint reduced density matrix (2RDM) inspired by quantum chemistry. Tracks eigenvalue statistics over sliding windows to derive two signals: spectral heat capacity (early warning of second-order phase transitions) and participation ratio (reveals dimensionality of reorganization).
Result: Validated across four distinct settings: deep linear networks, induction head formation, grokking, and emergent misalignment. The top eigenvectors of 2RDM are directly interpretable, making transitions easy to study.
Conclusion: The 2RDM provides a powerful, interpretable tool for studying emergent capabilities and phase transitions during training, with promising directions for future work.
Abstract: A key problem in the modern study of AI is predicting and understanding emergent capabilities in models during training. Inspired by methods for studying reactions in quantum chemistry, we present the ``2-datapoint reduced density matrix". We show that this object provides a computationally efficient, unified observable of phase transitions during training. By tracking the eigenvalue statistics of the 2RDM over a sliding window, we derive two complementary signals: the spectral heat capacity, which we prove provides early warning of second-order phase transitions via critical slowing down, and the participation ratio, which reveals the dimensionality of the underlying reorganization. Remarkably, the top eigenvectors of the 2RDM are directly interpretable making it straightforward to study the nature of the transitions. We validate across four settings distinct settings: deep linear networks, induction head formation, grokking, and emergent misalignment. We then discuss directions for future work using the 2RDM.
[520] AMShortcut: An Inference- and Training-Efficient Inverse Design Model for Amorphous Materials
Yan Lin, Jonas A. Finkler, Tao Du, Jilin Hu, Morten M. Smedskjaer
Main category: cs.LG
TL;DR: AMShortcut is an efficient probabilistic generative model for inverse design of amorphous materials that enables fast inference with few sampling steps and flexible property conditioning.
Details
Motivation: Amorphous materials lack long-range atomic order but have complex short/medium-range structures, requiring large simulation cells. Inverse design with generative models can facilitate applications in energy storage and thermal management, but existing methods suffer from inefficient inference requiring many sampling steps and inflexible property conditioning.Method: AMShortcut is a probabilistic generative model that enables accurate inference of diverse amorphous structures with only a few sampling steps. It can be trained once with all relevant properties and perform inference conditioned on arbitrary combinations of desired properties, eliminating the need for training separate models for each property combination.
Result: Experiments on three amorphous materials datasets with diverse structures and properties demonstrate that AMShortcut achieves its design goals of inference efficiency and flexible property conditioning.
Conclusion: AMShortcut provides an efficient approach for inverse design of amorphous materials, overcoming limitations of existing methods in inference efficiency and property flexibility, which could accelerate applications in energy storage and thermal management domains.
Abstract: Amorphous materials are solids that lack long-range atomic order but possess complex short- and medium-range order. Unlike crystalline materials that can be described by unit cells containing few up to hundreds of atoms, amorphous materials require larger simulation cells with at least hundreds or often thousands of atoms. Inverse design of amorphous materials with probabilistic generative models aims to generate the atomic positions and elements of amorphous materials given a set of desired properties. It has emerged as a promising approach for facilitating the application of amorphous materials in domains such as energy storage and thermal management. In this paper, we introduce AMShortcut, an inference- and training-efficient probabilistic generative model for amorphous materials. AMShortcut enables accurate inference of diverse short- and medium-range structures in amorphous materials with only a few sampling steps, mitigating the need for an excessive number of sampling steps that hinders inference efficiency. AMShortcut can be trained once with all relevant properties and perform inference conditioned on arbitrary combinations of desired properties, mitigating the need for training one model for each combination. Experiments on three amorphous materials datasets with diverse structures and properties demonstrate that AMShortcut achieves its design goals.
[521] Loss Gap Parity for Fairness in Heterogeneous Federated Learning
Brahim Erraji, Michaël Perrot, Aurélien Bellet
Main category: cs.LG
TL;DR: EAGLE is a federated learning algorithm that ensures fairness by minimizing disparities in loss gaps across clients, rather than just achieving loss parity.
Details
Motivation: Clients in federated learning are self-interested and want the global model to perform well on their own data, but existing methods may degrade performance for many clients when pursuing loss parity.Method: EAGLE explicitly regularizes the global model to minimize disparities in loss gaps across clients, prioritizing clients furthest from their local optimal loss.
Result: EAGLE reduces disparity in loss gaps among clients while maintaining competitive utility compared to strong baselines in both convex and non-convex cases.
Conclusion: EAGLE provides a novel approach to fairness in federated learning by targeting fairness in relative improvements rather than absolute loss parity.
Abstract: While clients may join federated learning to improve performance on data they rarely observe locally, they often remain self-interested, expecting the global model to perform well on their own data. This motivates an objective that ensures all clients achieve a similar loss gap -the difference in performance between the global model and the best model they could train using only their local data-. To this end, we propose EAGLE, a novel federated learning algorithm that explicitly regularizes the global model to minimize disparities in loss gaps across clients. Our approach is particularly effective in heterogeneous settings, where the optimal local models of the clients may be misaligned. Unlike existing methods that encourage loss parity, potentially degrading performance for many clients, EAGLE targets fairness in relative improvements. We provide theoretical convergence guarantees for EAGLE under non-convex loss functions, and characterize how its iterates perform relative to the standard federated learning objective using a novel heterogeneity measure. Empirically, we demonstrate that EAGLE reduces the disparity in loss gaps among clients by prioritizing those furthest from their local optimal loss, while maintaining competitive utility in both convex and non-convex cases compared to strong baselines.
[522] Curvature-Guided LoRA: Steering in the pretrained NTK subspace
Frédéric Zheng, Alexandre Proutière
Main category: cs.LG
TL;DR: CG-LoRA: A curvature-guided LoRA variant that uses second-order information to better align predictions with full fine-tuning, improving performance and convergence speed.
Details
Motivation: Parameter-efficient fine-tuning methods like LoRA often underperform compared to full fine-tuning. Existing approaches focus on parameter alignment rather than directly controlling model predictions, leading to suboptimal results.Method: Introduces prediction alignment problem to match PEFT outputs to full fine-tuning. Derives curvature-aware second-order formulation where optimal low-rank updates correspond to Newton-like, curvature-whitened gradients. Proposes CG-LoRA which selects and scales adaptation directions using local curvature information without explicit second-order matrix construction.
Result: Preliminary experiments on standard natural language understanding benchmarks show improved performance and faster convergence compared to existing LoRA variants.
Conclusion: Direct prediction alignment through curvature-guided adaptation improves PEFT performance, making it more competitive with full fine-tuning while maintaining efficiency.
Abstract: Parameter-efficient fine-tuning methods such as LoRA enable efficient adaptation of large pretrained models but often fall short of full fine-tuning performance. Existing approaches focus on aligning parameter updates, which only indirectly control model predictions. In this work, we introduce the prediction alignment problem, aiming to match the predictor obtained via PEFT to that of full fine-tuning at the level of outputs. We show that this objective naturally leads to a curvature-aware, second-order formulation, where optimal low-rank updates correspond to a Newton-like, curvature-whitened gradient. Based on this insight, we propose Curvature-Guided LoRA (CG-LoRA), which selects and scales adaptation directions using local curvature information. Our method is computationally efficient and avoids explicit second-order matrix construction. Preliminary experiments on standard natural language understanding benchmarks demonstrate improved performance and faster convergence compared to existing LoRA variants.
[523] DiSGMM: A Method for Time-varying Microscopic Weight Completion on Road Networks
Yan Lin, Jilin Hu, Shengnan Guo, Christian S. Jensen, Youfang Lin, Huaiyu Wan
Main category: cs.LG
TL;DR: DiSGMM: A method for time-varying microscopic road-network weight completion using sparsity-aware embeddings and Gaussian mixture models to estimate travel speed distributions from sparse vehicle data.
Details
Motivation: Microscopic road-network weights (like travel speeds from individual vehicles) are crucial for traffic simulation and routing but suffer from two layers of sparsity: missing data at both network and segment levels, requiring flexible distribution estimation that can handle complex traffic patterns.Method: DiSGMM combines sparsity-aware embeddings with spatiotemporal modeling to leverage sparse known weights, learned segment properties, and long-range correlations. It represents weight distributions as learnable Gaussian mixture models for closed-form, flexible distribution estimation.
Result: Experiments on two real-world datasets show DiSGMM outperforms state-of-the-art methods for microscopic weight completion.
Conclusion: DiSGMM effectively addresses the dual sparsity challenge in microscopic road-network weight completion by providing flexible, closed-form distribution estimation through Gaussian mixture models with spatiotemporal modeling.
Abstract: Microscopic road-network weights represent fine-grained, time-varying traffic conditions obtained from individual vehicles. An example is travel speeds associated with road segments as vehicles traverse them. These weights support tasks including traffic microsimulation and vehicle routing with reliability guarantees. We study the problem of time-varying microscopic weight completion. During a time slot, the available weights typically cover only some road segments. Weight completion recovers distributions for the weights of every road segment at the current time slot. This problem involves two challenges: (i) contending with two layers of sparsity, where weights are missing at both the network layer (many road segments lack weights) and the segment layer (a segment may have insufficient weights to enable accurate distribution estimation); and (ii) achieving a weight distribution representation that is closed-form and can capture complex conditions flexibly, including heavy tails and multiple clusters. To address these challenges, we propose DiSGMM that combines sparsity-aware embeddings with spatiotemporal modeling to leverage sparse known weights alongside learned segment properties and long-range correlations for distribution estimation. DiSGMM represents distributions of microscopic weights as learnable Gaussian mixture models, providing closed-form distributions capable of capturing complex conditions flexibly. Experiments on two real-world datasets show that DiSGMM can outperform state-of-the-art methods.
[524] Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Alan Sun, Mariya Toneva
Main category: cs.LG
TL;DR: Paper proposes a framework for mechanistic interpretability using interpretive equivalence - determining if two models share common algorithmic interpretations without explicit description.
Details
Motivation: Mechanistic interpretability (MI) is difficult to scale and generalize due to lack of precise notion of valid interpretation and ad hoc interpretation generation processes.Method: Defines interpretive equivalence principle: two interpretations are equivalent if all possible implementations are equivalent. Develops algorithm to estimate interpretive equivalence and analyzes Transformer-based models using representation similarity conditions.
Result: Provides necessary and sufficient conditions for interpretive equivalence based on representation similarity, with guarantees relating algorithmic interpretations, circuits, and representations.
Conclusion: Framework lays foundation for more rigorous MI evaluation methods and automated, generalizable interpretation discovery methods.
Abstract: Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model’s decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models’ representation similarity. We provide guarantees that simultaneously relate a model’s algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
[525] Task Scarcity and Label Leakage in Relational Transfer Learning
Francisco Galuppo Azevedo, Clarissa Lima Loures, Denis Oliveira Correa
Main category: cs.LG
TL;DR: The paper addresses label leakage in relational foundation models where limited task supervision causes representations to encode task-specific shortcuts, degrading transfer performance within the same schema.
Details
Motivation: Relational foundation models need to learn representations that transfer across tasks, but available supervision is typically limited to a small number of prediction targets per database. This task scarcity causes learned representations to encode task-specific shortcuts (label leakage), which degrades transfer performance even within the same schema.Method: The authors propose K-Space, a modular architecture combining frozen pretrained tabular encoders with a lightweight message-passing core. To suppress label leakage, they introduce a gradient projection method that removes label-predictive directions from representation updates.
Result: On RelBench, the gradient projection method improves within-dataset transfer by +0.145 AUROC on average, often recovering near single-task performance. The results suggest that limited task diversity, not just limited data, constrains relational foundation models.
Conclusion: Label leakage is a significant problem in relational foundation models caused by limited task supervision. The proposed gradient projection method effectively suppresses leakage and improves transfer performance, highlighting that task diversity (not just data quantity) is a key constraint for these models.
Abstract: Training relational foundation models requires learning representations that transfer across tasks, yet available supervision is typically limited to a small number of prediction targets per database. This task scarcity causes learned representations to encode task-specific shortcuts that degrade transfer even within the same schema, a problem we call label leakage. We study this using K-Space, a modular architecture combining frozen pretrained tabular encoders with a lightweight message-passing core. To suppress leakage, we introduce a gradient projection method that removes label-predictive directions from representation updates. On RelBench, this improves within-dataset transfer by +0.145 AUROC on average, often recovering near single-task performance. Our results suggest that limited task diversity, not just limited data, constrains relational foundation models.
[526] Reward-Based Online LLM Routing via NeuralUCB
Ming-Hua Tsai, Phat Tran
Main category: cs.LG
TL;DR: NeuralUCB-based cost-aware LLM routing method outperforms baselines in utility reward while reducing inference costs compared to max-quality reference.
Details
Motivation: Existing LLM routing approaches have tradeoffs between efficiency and adaptivity - supervised methods lack adaptivity while partial-feedback methods may be inefficient. Need for cost-aware routing that balances quality and inference costs.Method: Implemented NeuralUCB-based routing policy and evaluated on RouterBench under simulated online setting. Uses neural contextual bandits for adaptive routing decisions.
Result: Method consistently outperforms random and min-cost baselines in utility reward. Achieves substantially lower inference cost while maintaining competitive reward compared to max-quality reference.
Conclusion: NeuralUCB is promising for cost-aware LLM routing, but challenges remain in action discrimination and exploration.
Abstract: This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.
[527] Real-Time Explanations for Tabular Foundation Models
Luan Borges Teodoro Reis Sena, Francisco Galuppo Azevedo
Main category: cs.LG
TL;DR: ShapPFN is a foundation model that integrates Shapley value regression into its architecture to provide both predictions and explanations in a single forward pass, achieving competitive performance with 1000× faster explanations than KernelSHAP.
Details
Motivation: Interpretability is crucial for scientific machine learning to understand why models make predictions for hypothesis generation and validation. While tabular foundation models show strong performance, existing explanation methods like SHAP are computationally expensive, limiting interactive exploration.Method: ShapPFN integrates Shapley value regression directly into its architecture, enabling the model to produce both predictions and explanations in a single forward pass. This approach eliminates the need for separate post-hoc explanation methods.
Result: On standard benchmarks, ShapPFN achieves competitive performance while producing high-fidelity explanations (R²=0.96, cosine=0.99) over 1000× faster than KernelSHAP (0.06s vs 610s).
Conclusion: ShapPFN provides an efficient solution for interpretable machine learning by combining prediction and explanation capabilities in a single model, enabling interactive exploration of tabular data with fast, high-quality explanations.
Abstract: Interpretability is central for scientific machine learning, as understanding \emph{why} models make predictions enables hypothesis generation and validation. While tabular foundation models show strong performance, existing explanation methods like SHAP are computationally expensive, limiting interactive exploration. We introduce ShapPFN, a foundation model that integrates Shapley value regression directly into its architecture, producing both predictions and explanations in a single forward pass. On standard benchmarks, ShapPFN achieves competitive performance while producing high-fidelity explanations ($R^2$=0.96, cosine=0.99) over 1000\times faster than KernelSHAP (0.06s vs 610s). Our code is available at https://github.com/kunumi/ShapPFN
[528] Meteorology-Driven GPT4AP: A Multi-Task Forecasting LLM for Atmospheric Air Pollution in Data-Scarce Settings
Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan
Main category: cs.LG
TL;DR: GPT4AP: A parameter-efficient GPT-2-based framework for air pollution forecasting using meteorology-driven adaptation and few-shot/zero-shot learning.
Details
Motivation: Air pollution forecasting models often struggle with generalization in regions with sparse observations, requiring data-efficient approaches that can work with limited supervision and handle domain shifts.Method: Uses pre-trained GPT-2 backbone with frozen self-attention/feed-forward layers, adapting lightweight positional and output modules via Gaussian rank-stabilized low-rank adaptation (rsLoRA) for parameter efficiency.
Result: Outperforms DLinear and ETSformer in few-shot (10% data) with MSE/MAE 0.686/0.442; achieves 0.529/0.403 in zero-shot cross-station transfer; remains competitive in long-term forecasting with full data (MAE 0.429).
Conclusion: GPT4AP provides a data-efficient forecasting approach that performs robustly under limited supervision and domain shift while maintaining competitive accuracy in data-rich settings.
Abstract: Accurate forecasting of air pollution is important for environmental monitoring and policy support, yet data-driven models often suffer from limited generalization in regions with sparse observations. This paper presents Meteorology-Driven GPT for Air Pollution (GPT4AP), a parameter-efficient multi-task forecasting framework based on a pre-trained GPT-2 backbone and Gaussian rank-stabilized low-rank adaptation (rsLoRA). The model freezes the self-attention and feed-forward layers and adapts lightweight positional and output modules, substantially reducing the number of trainable parameters. GPT4AP is evaluated on six real-world air quality monitoring datasets under few-shot, zero-shot, and long-term forecasting settings. In the few-shot regime using 10% of the training data, GPT4AP achieves an average MSE/MAE of 0.686/0.442, outperforming DLinear (0.728/0.530) and ETSformer (0.734/0.505). In zero-shot cross-station transfer, the proposed model attains an average MSE/MAE of 0.529/0.403, demonstrating improved generalization compared with existing baselines. In long-term forecasting with full training data, GPT4AP remains competitive, achieving an average MAE of 0.429, while specialized time-series models show slightly lower errors. These results indicate that GPT4AP provides a data-efficient forecasting approach that performs robustly under limited supervision and domain shift, while maintaining competitive accuracy in data-rich settings.
[529] Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration
Iain Swift, JingHua Ye, Ruairi O’Reilly
Main category: cs.LG
TL;DR: The paper adapts InterSHAP to quantify cross-modal interactions in multimodal cancer prognosis models, finding that better-performing architectures show lower cross-modal interaction, suggesting performance gains come from complementary signal aggregation rather than learned synergy.
Details
Motivation: To test the common assumption that multimodal deep learning for cancer prognosis benefits from synergistic cross-modal interactions, which hasn't been directly tested in survival prediction settings.Method: Adapts InterSHAP (Shapley interaction index-based metric) from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction using TCGA-GBM and TCGA-LGG data (n=575). Evaluates four fusion architectures combining whole-slide image (WSI) and RNA-seq features.
Result: Found inverse relationship between predictive performance and measured interaction: architectures with superior discrimination (C-index 0.64→0.82) exhibit equivalent or lower cross-modal interaction (4.8%→3.0%). Variance decomposition shows stable additive contributions (WSI≈40%, RNA≈55%, Interaction≈4%), indicating performance gains arise from complementary signal aggregation rather than learned synergy.
Conclusion: Provides a practical model auditing tool for comparing fusion strategies, reframes the role of architectural complexity in multimodal fusion, and has implications for privacy-preserving federated deployment.
Abstract: Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64$\to$0.82) exhibit equivalent or lower cross-modal interaction (4.8%$\to$3.0%). Variance decomposition reveals stable additive contributions across all architectures (WSI${\approx}$40%, RNA${\approx}$55%, Interaction${\approx}$4%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.
[530] Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction
Alexander Brenning, Thomas Suesse
Main category: cs.LG
TL;DR: TWCV is a cross-validation method that accounts for distribution mismatch between validation and deployment tasks in spatial prediction, addressing covariate and task-difficulty shift through calibration weighting and buffered resampling.
Details
Motivation: Standard cross-validation assumes validation and deployment tasks come from the same distribution, which is often violated in spatial prediction and structured data settings, leading to biased risk estimates when there are covariate shifts or task-difficulty differences.Method: Proposes Target-Weighted CV (TWCV) that uses calibration weighting to assign weights to validation losses so the weighted empirical distribution matches the target deployment distribution. Combines with spatially buffered resampling to ensure adequate coverage of deployment distribution support.
Result: Simulation studies show conventional and spatial estimators exhibit substantial bias depending on sampling, while buffered TWCV remains approximately unbiased across scenarios. Environmental pollution mapping case study confirms TWCV better reflects prediction tasks over target domain.
Conclusion: Task distribution mismatch is a primary source of CV bias in spatial prediction, and calibration weighting combined with suitable validation task generation provides viable approach for estimating predictive risk under dataset shift.
Abstract: Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution’s support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.
[531] Refined Detection for Gumbel Watermarking
Tor Lattimore
Main category: cs.LG
TL;DR: Proposes a simple detection mechanism for Gumbel watermarking scheme that is proven near-optimal among model-agnostic watermarking schemes under i.i.d. next-token distribution assumption.
Details
Motivation: To develop an improved detection mechanism for the Gumbel watermarking scheme that achieves near-optimal performance in detecting watermarked text, particularly for model-agnostic watermarking approaches.Method: Proposes a new detection mechanism for the Gumbel watermarking scheme, with theoretical analysis proving its near-optimality among all model-agnostic watermarking schemes under the assumption of i.i.d. next-token distributions.
Result: The proposed detection mechanism is proven to be near-optimal in a problem-dependent sense, providing theoretical guarantees for its performance in detecting watermarked text.
Conclusion: The paper presents a theoretically sound detection mechanism for Gumbel watermarking that achieves near-optimal performance, advancing the field of text watermarking for large language models.
Abstract: We propose a simple detection mechanism for the Gumbel watermarking scheme proposed by Aaronson (2022). The new mechanism is proven to be near-optimal in a problem-dependent sense among all model-agnostic watermarking schemes under the assumption that the next-token distribution is sampled i.i.d.
[532] Tucker Attention: A generalization of approximate attention mechanisms
Timon Klein, Jonas Kusch, Sebastian Sager, Stefan Schnake, Steffen Schotthöfer
Main category: cs.LG
TL;DR: Tucker Attention: A generalized low-rank factorization method for self-attention that reduces parameters while maintaining performance, encompassing MHA, GQA, and MLA as special cases.
Details
Motivation: To understand and improve upon existing memory-efficient attention methods (GQA, MLA) by providing a unified low-rank approximation framework that reveals what these methods actually approximate and enables more parameter-efficient designs.Method: Proposes Tucker Attention, a generalized factorization strategy for self-attention weights using Tucker decomposition principles. It provides a unified framework that encompasses MHA, GQA, and MLA as special cases, enabling parameter reduction while maintaining compatibility with existing optimizations like flash-attention and RoPE.
Result: Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics compared to GQA and MLA, as demonstrated in LLM and ViT test cases. The method also provides insights into the actual ranks achieved by existing attention methods and enables simplifications for MLA.
Conclusion: Tucker Attention offers a principled, parameter-efficient alternative to existing attention mechanisms with strong empirical performance and theoretical insights into low-rank approximations in self-attention.
Abstract: The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.
[533] Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah
Main category: cs.LG
TL;DR: A framework for predicting when Chain-of-Thought (CoT) monitorability is affected by training, showing that conflicting reward terms reduce CoT transparency while aligned terms improve it.
Details
Motivation: Chain-of-Thought monitoring is promising for AI oversight, but training can affect CoT monitorability as models may learn to hide reasoning. Need to understand when and why this occurs.Method: Model LLM post-training as RL with reward decomposing into output-dependent and CoT-dependent terms. Classify these terms as aligned, orthogonal, or in-conflict. Train LLMs in these environments and evaluate CoT monitorability changes.
Result: Training with in-conflict reward terms reduces CoT monitorability, and optimizing such conflicting terms is difficult. Orthogonal terms don’t affect monitorability, while aligned terms improve it.
Conclusion: The proposed framework successfully predicts how different reward structures affect CoT monitorability during training, with important implications for designing oversight mechanisms.
Abstract: Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model’s CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as “aligned”, “orthogonal”, or “in-conflict” before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with “in-conflict” reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.
[534] Unbounded Density Ratio Estimation and Its Application to Covariate Shift Adaptation
Ren-Rui Liu, Jun Fan, Lei Shi, Zheng-Chu Guo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.29725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
Ginés Carreto Picón, Peng Yuan Zhou, Qi Zhang, Alexandros Iosifidis
Main category: cs.LG
TL;DR: DeepCoT: A redundancy-free encoder attention mechanism for deep Transformer models that enables efficient stream data inference with linear computational cost.
Details
Motivation: There's a growing need for high-performance, low-latency inference on resource-constrained devices, especially for stream data where sliding window approaches cause redundant computations. Existing Continual Transformers only work well with shallow models, limiting their applicability.Method: Proposes Deep Continual Transformer (DeepCoT), a redundancy-free encoder attention mechanism that can be applied to existing deep encoder architectures with minimal changes. It eliminates computational redundancy in stream data processing.
Result: DeepCoTs maintain comparable performance to non-continual baselines while offering linear computational cost across all Transformer layers. Achieves up to two orders of magnitude reduction in running time compared to previous efficient models, demonstrated on audio, video, and text streams.
Conclusion: DeepCoT provides an effective solution for efficient stream data inference with deep Transformer models, addressing the computational redundancy problem while maintaining performance across multimodal domains.
Abstract: Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for high performance, low-latency inference on devices with limited resources. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. While the recent Continual Transformers started addressing this issue, they can be effectively used only in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder attention mechanism that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.
[536] HyperKKL: Learning KKL Observers for Non-Autonomous Nonlinear Systems via Hypernetwork-Based Input Conditioning
Yahia Salaheldin Shaaban, Abdelrahman Sayed Sayed, M. Umar B. Niazi, Karl Henrik Johansson
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.29744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] Early Exiting Predictive Coding Neural Networks for Edge AI
Alaa Zniber, Mounir Ghogho, Ouassim Karrakchou, Mehdi Zakroum
Main category: cs.LG
TL;DR: A shallow bidirectional predictive coding network with early exiting for efficient edge AI on IoT devices, achieving comparable performance to deep networks with fewer parameters and lower computational complexity.
Details
Motivation: IoT devices have limited computational resources, making conventional deep learning models impractical. Privacy concerns and real-time processing needs require local computation rather than cloud-based solutions. The brain's energy efficiency inspires more efficient architectures for edge AI.Method: Proposes a shallow bidirectional predictive coding network with early exiting mechanism that dynamically halts computations once a performance threshold is met, reducing memory footprint and computational overhead.
Result: Validated on CIFAR-10 dataset, achieving performance comparable to deep networks with significantly fewer parameters and lower computational complexity.
Conclusion: Demonstrates the potential of biologically inspired architectures for efficient edge AI, addressing computational constraints of IoT devices while maintaining accuracy.
Abstract: The Internet of Things is transforming various fields, with sensors increasingly embedded in wearables, smart buildings, and connected equipment. While deep learning enables valuable insights from IoT data, conventional models are too computationally demanding for resource-limited edge devices. Moreover, privacy concerns and real-time processing needs make local computation a necessity over cloud-based solutions. Inspired by the brain’s energy efficiency, we propose a shallow bidirectional predictive coding network with early exiting, dynamically halting computations once a performance threshold is met. This reduces the memory footprint and computational overhead while maintaining high accuracy. We validate our approach using the CIFAR-10 dataset. Our model achieves performance comparable to deep networks with significantly fewer parameters and lower computational complexity, demonstrating the potential of biologically inspired architectures for efficient edge AI.
[538] GenOL: Generating Diverse Examples for Name-only Online Learning
Minhyuk Seo, Seongwon Cho, Minjae Lee, Diganta Misra, Hyeonbeom Choi, Seon Joo Kim, Jonghyun Choi
Main category: cs.LG
TL;DR: GenOL uses generative models for name-only continual learning, enhancing diversity through diverse prompt generation and ensemble strategies to overcome limitations of human and web supervision.
Details
Motivation: In continual learning with data distribution shifts, real-time manual annotation is impractical. The 'name-only' setup requires only concept names, but web-scraped images have issues like imbalance, noise, and copyright. Generative models offer an alternative but naive application results in limited diversity.Method: GenOL enhances diversity in two ways: (1) intra-diversity through diverse prompt generation for text-to-image models, and (2) inter-diversity through ensemble strategy selecting minimally overlapping samples from multiple generative models.
Result: GenOL outperforms prior methods and even models trained with fully supervised data by large margins in image recognition and multi-modal visual reasoning tasks.
Conclusion: Generative models can effectively address name-only continual learning when enhanced with diversity-promoting techniques, offering a practical solution to annotation challenges in online learning scenarios.
Abstract: Online learning methods often rely on supervised data. However, under data distribution shifts, such as in continual learning (CL), where continuously arriving online data streams incorporate new concepts (e.g., classes), real-time manual annotation is impractical due to its costs and latency, which hinder real-time adaptation. To alleviate this, ’name-only’ setup has been proposed, requiring only the name of concepts, not the supervised samples. A recent approach tackles this setup by supplementing data with web-scraped images, but such data often suffers from issues of data imbalance, noise, and copyright. To overcome the limitations of both human supervision and webly supervision, we propose GenOL using generative models for name-only training. But naive application of generative models results in limited diversity of generated data. Here, we enhance (i) intra-diversity, the diversity of images generated by a single model, by proposing a diverse prompt generation method that generates diverse text prompts for text-to-image models, and (ii) inter-diversity, the diversity of images generated by multiple generative models, by introducing an ensemble strategy that selects minimally overlapping samples. We empirically validate that the proposed \frameworkname outperforms prior arts, even a model trained with fully supervised data by large margins, in various tasks, including image recognition and multi-modal visual reasoning.
[539] Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting
Tao Han, Zhibin Wen, Zhenghao Chen, Dazhao Du, Song Guo, Lei Bai
Main category: cs.LG
TL;DR: PhysicsFormer: A physics-informed Transformer model for global weather forecasting using a new large-scale WEATHER-5K dataset, bridging the gap between academic time-series models and operational numerical weather prediction systems.
Details
Motivation: Existing time-series forecasting models for weather prediction are limited by small, sparse datasets and fail to match operational numerical weather prediction systems in capturing complex weather dynamics and extreme events.Method: Proposes PhysicsFormer, combining a dynamic core with Transformer residual layers, enforced with physics constraints via pressure-wind alignment and energy-aware smoothness losses to ensure physical consistency while capturing complex temporal patterns.
Result: Benchmarks PhysicsFormer and other TSF models against operational systems across multiple weather variables, extreme event prediction, and model complexity, providing comprehensive assessment of the gap between academic and operational forecasting.
Conclusion: PhysicsFormer with WEATHER-5K dataset advances weather forecasting by incorporating physics constraints into deep learning models, helping bridge the performance gap between academic time-series models and operational numerical weather prediction systems.
Abstract: The development of Time-Series Forecasting (TSF) models is often constrained by the lack of comprehensive datasets, especially in Global Station Weather Forecasting (GSWF), where existing datasets are small, temporally short, and spatially sparse. To address this, we introduce WEATHER-5K, a large-scale observational weather dataset that better reflects real-world conditions, supporting improved model training and evaluation. While recent TSF methods perform well on benchmarks, they lag behind operational Numerical Weather Prediction systems in capturing complex weather dynamics and extreme events. We propose PhysicsFormer, a physics-informed forecasting model combining a dynamic core with a Transformer residual to predict future weather states. Physical consistency is enforced via pressure-wind alignment and energy-aware smoothness losses, ensuring plausible dynamics while capturing complex temporal patterns. We benchmark PhysicsFormer and other TSF models against operational systems across several weather variables, extreme event prediction, and model complexity, providing a comprehensive assessment of the gap between academic TSF models and operational forecasting. The dataset and benchmark implementation are available at: https://github.com/taohan10200/WEATHER-5K.
[540] A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms
Weiqin Chen, Mark S. Squillante, Chai Wah Wu, Santiago Paternain
Main category: cs.LG
TL;DR: Control-theoretic reinforcement learning approach for direct optimal policy learning with theoretical guarantees and empirical improvements over SOTA methods.
Details
Motivation: To develop a more efficient reinforcement learning framework by integrating control theory principles, aiming to improve policy learning directly rather than through value function approximation.Method: Control-theoretic reinforcement learning approach with analog of Bellman operator, Q-learning, control-policy-variable gradient theorem, and specific gradient ascent algorithm within control-theoretic framework.
Result: Theoretical properties established (convergence, optimality), and empirical evaluation shows significant improvements in solution quality, sample complexity, and running time over state-of-the-art methods on classical RL tasks.
Conclusion: Control-theoretic approach provides effective framework for direct policy learning with strong theoretical foundations and practical performance advantages.
Abstract: We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish various theoretical properties of our approach, such as convergence and optimality of our analog of the Bellman operator and Q-learning, a new control-policy-variable gradient theorem, and a specific gradient ascent algorithm based on this theorem within the context of a specific control-theoretic framework. We empirically evaluate the performance of our control theoretic approach on several classical reinforcement learning tasks, demonstrating significant improvements in solution quality, sample complexity, and running time of our approach over state-of-the-art methods.
[541] An Information-Theoretic Approach to Understanding Transformers’ In-Context Learning of Variable-Order Markov Chains
Ruida Zhou, Chao Tian, Suhas Diggavi
Main category: cs.LG
TL;DR: Transformers struggle with in-context learning of variable-length Markov chains (VOMCs) compared to fixed-order ones, requiring deeper architectures (2+ layers) to implement optimal Bayesian algorithms like context-tree weighting.
Details
Motivation: Understanding transformers' ability to learn complex sequential patterns in-context, particularly variable-length Markov chains which are more challenging than fixed-order Markov chains due to structural learning requirements.Method: Empirical study of transformer architectures on VOMC learning tasks, comparison with fixed-order Markov chains, analysis of layer requirements, and explicit transformer constructions to implement context-tree weighting algorithms.
Result: Single-layer transformers fail at VOMC learning, while 2+ layer transformers succeed with modest improvements from additional layers. Attention-only networks are insufficient. Explicit constructions show D+2 layers can exactly implement CTW, and 2-layer transformers can approximate it.
Conclusion: Transformers require depth (multiple layers) to learn variable-length Markov chains in-context, with explicit architectural designs needed to implement optimal Bayesian algorithms like context-tree weighting.
Abstract: We study transformers’ in-context learning of variable-length Markov chains (VOMCs), focusing on the finite-sample accuracy as the number of in-context examples increases. Compared to fixed-order Markov chains (FOMCs), learning VOMCs is substantially more challenging due to the additional structural learning component. The problem is naturally suited to a Bayesian formulation, where the context-tree weighting (CTW) algorithm, originally developed in the information theory community for universal data compression, provides an optimal solution. Empirically, we find that single-layer transformers fail to learn VOMCs in context, whereas transformers with two or more layers can succeed, with additional layers yielding modest but noticeable improvements. In contrast to prior results on FOMCs, attention-only networks appear insufficient for VOMCs. To explain these findings, we provide explicit transformer constructions: one with $D+2$ layers that can exactly implement CTW for VOMCs of maximum order $D$, and a simplified two-layer construction that uses partial information for approximate blending, shedding light on why two-layer transformers can perform well.
[542] Value Gradient Sampler: Learning Invariant Value Functions for Equivariant Diffusion Sampling
Himchan Hwang, Hyeokju Jeong, Dong Kyu Shin, Che-Sang Park, Sehee Kweon, Sangwoong Yoon, Frank Chongwoo Park
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.13280 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper due to content unavailability.
Abstract: Failed to fetch summary for 2502.13280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] SO(3)-Equivariant Neural Networks for Learning from Scalar and Vector Fields on Spheres
Francesco Ballerin, Nello Blaser, Erlend Grong
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2503.09456 suggests it’s from March 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The arXiv API rate limiting prevents fetching the abstract or details.Method: Cannot determine method without access to the paper content. The HTTP 429 error indicates too many requests to arXiv API.
Result: Cannot determine results without access to the paper content. The paper ID format suggests it’s a recent preprint from March 2025.
Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the arXiv API.
Abstract: Failed to fetch summary for 2503.09456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment
Sy-Tuyen Ho, Koh Jun Hao, Ngoc-Bao Nguyen, Alexander Binder, Ngai-Man Cheung
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to draw conclusions due to retrieval error
Abstract: Failed to fetch summary for 2505.03519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Neural Graduated Assignment for Maximum Common Edge Subgraphs
Chaolong Ying, Yingqi Ruan, Xuemin Chen, Yaomin Wang, Tianshu Yu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to analyze paper content due to technical retrieval issues
Abstract: Failed to fetch summary for 2505.12325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] PreMoE: Proactive Inference for Efficient Mixture-of-Experts
Zehua Pei, Ying Zhang, Hui-Ling Zhen, Tao Yuan, Xianzhi Yu, Zhenhua Dong, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2505.17639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] When fractional quasi p-norms concentrate
Ivan Y. Tyukin, Bogdan Grechuk, Evgeny M. Mirkes, Alexander N. Gorban
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.19635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] Epistemic Errors of Imperfect Multitask Learners When Distributions Shift
Sabina J. Sloman, Michele Caprio, Samuel Kaski
Main category: cs.LG
TL;DR: Paper ID 2505.23496 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.23496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] Deep Polynomial Chaos Expansion
Johannes Exenberger, Sascha Ranftl, Robert Peharz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.21273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] Exploring the Relationship between Brain Hemisphere States and Frequency Bands through Classical Machine Learning and Deep Learning Optimization Techniques with Neurofeedback
Robiul Islam, Dmitry I. Ignatov, Karl Kaberg, Roman Nabatchikov
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.14078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] On the Convergence of Muon and Beyond
Da Chang, Yongxiang Liu, Ganzhao Yuan
Main category: cs.LG
TL;DR: Paper ID 2509.15816 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access issuesMethod: Unable to determine method due to access issues
Result: Unable to determine results due to access issues
Conclusion: Unable to determine conclusion due to access issues
Abstract: Failed to fetch summary for 2509.15816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media
M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.16139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI
Bogdan Raonić, Siddhartha Mishra, Samuel Lanthaler
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.25080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] From Moments to Models: Graphon-Mixture Learning for Mixup and Contrastive Learning
Ali Azizpour, Reza Ramezanpour, Santiago Segarra
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2510.03690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning
Arindam Chowdhury, Massimiliano Lupo Pasini
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2510.05583
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.05583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] The Effect of Attention Head Count on Transformer Approximation
Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.06662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] Continuous SUN (Stable, Unique, and Novel) Metric for Generative Modeling of Inorganic Crystals
Masahiro Negishi, Hyunsoo Park, Kinga O. Mastej, Aron Walsh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.12405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] Balancing Multi-modal Sensor Learning via Multi-objective Optimization
Heshan Fernando, Quan Xiao, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, Tianyi Chen
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper content
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2511.06686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] Deep Unfolding: Recent Developments, Theory, and Design Guidelines
Nir Shlezinger, Santiago Segarra, Yi Zhang, Dvir Avrahami, Zohar Davidov, Tirza Routtenberg, Yonina C. Eldar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2512.03768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods
Iason Kyriakopoulos, Yannis Theodoridis
Main category: cs.LG
TL;DR: Paper 2512.17257 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Method information not available
Result: Results cannot be assessed
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2512.17257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Precision autotuning for linear solvers via contextual bandit-based RL
Erin Carson, Xinye Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.00728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
Aakriti Lnu, Zhe Li, Dandan Liang, Chao Huang, Rui Li, Haibo Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.10940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] Sparsity-Aware Unlearning for Large Language Models
Yuze Wang, Yujia Tong, Xuan Liu, Junhao Dong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.00577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning
Jie Xiao, Meng Chen, Qingnan Ren, Jingwei Song, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Ween Yang, Lynn Ai, Eric Yang, Bill Shi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.02192 suggests it’s from February 2026, which is unusual for current date.
Details
Motivation: Cannot determine motivation without access to the paper content due to HTTP 429 error from arXiv API.Method: Cannot determine method without access to the paper content due to HTTP 429 error from arXiv API.
Result: Cannot determine results without access to the paper content due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions without access to the paper content due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2602.02192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] Joint Embedding Variational Bayes
Amin Oji, Paul Fieguth
Main category: cs.LG
TL;DR: Paper 2602.05639: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2602.05639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] A Fast and Generalizable Fourier Neural Operator-Based Surrogate for Melt-Pool Prediction in Laser Processing
Alix Benoit, Toni Ivas, Mateusz Papierz, Asel Sagingalieva, Alexey Melnikov, Elia Iseli
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.06241 suggests it’s from February 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.06241: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06241&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] Variational inference via radial transport
Luca Ghafourpour, Sinho Chewi, Alessio Figalli, Aram-Alexandre Pooladian
Main category: cs.LG
TL;DR: The paper with ID 2602.17525 could not be fetched due to HTTP 429 error (rate limiting), so no content is available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to rate limiting from arXiv API.Method: Unable to determine method as the paper content could not be retrieved due to rate limiting from arXiv API.
Result: Unable to determine results as the paper content could not be retrieved due to rate limiting from arXiv API.
Conclusion: Unable to determine conclusion as the paper content could not be retrieved due to rate limiting from arXiv API.
Abstract: Failed to fetch summary for 2602.17525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] Privacy-Preserving Machine Learning for IoT: A Cross-Paradigm Survey and Future Roadmap
Zakia Zaman, Praveen Gauravaram, Mahbub Hassan, Sanjay Jha, Wen Hu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.13570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16077: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16077&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] AdaMuS: Adaptive Multi-view Sparsity Learning for Dimensionally Unbalanced Data
Cai Xu, Changhao Sun, Ziyu Guan, Wei Zhao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.17610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs
Abhishek Gupta, Aditya Mahajan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19835: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19835&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] SkillRouter: Skill Routing for LLM Agents at Scale
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.22455 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as paper content is inaccessible due to API rate limiting
Abstract: Failed to fetch summary for 2603.22455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback
Haishan Ye
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.25029: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25029&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning
Bodla Krishna Vamshi, Haizhao Yang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.27971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] Neural Federated Learning for Livestock Growth Prediction
Shoujin Wang, Mingze Ni, Wei Liu, Victor W. Chu, Bryan Zheng, Ayush Kanwal, Roy Jing Yang, Kenny Sabir, Fang Chen
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.28117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] Quadratic Gradient: A Unified Framework Bridging Gradient Descent and Newton-Type Methods by Synthesizing Hessians and Gradients
John Chiang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2209.03282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2209.03282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] Cheap Bootstrap for Fast Uncertainty Quantification of Stochastic Gradient Descent
Henry Lam, Zitong Wang
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2310.11065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.11065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] Exploring Prime Number Classification: Achieving High Recall Rate and Rapid Convergence with Sparse Encoding
Serin Lee, S. Kim
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2402.03363: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.03363&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] Measuring the Predictability of Recommender Systems using Structural Complexity Metrics
Andrés Abeliuk, Alfonso Valderrama, Simón Campos, Marcelo Mendoza
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2404.08829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.08829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] JKO for Landau: a variational particle method for homogeneous Landau equation
Yan Huang, Li Wang
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions about paper content due to access error
Abstract: Failed to fetch summary for 2409.12296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.12296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] Tightening convex relaxations of trained neural networks: a unified approach for convex and S-shaped activations
Pablo Carrasco, Gonzalo Muñoz
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The arXiv API request failed, so no paper content is available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting error.Method: Cannot determine method as paper content is unavailable due to API rate limiting error.
Result: Cannot determine results as paper content is unavailable due to API rate limiting error.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting error.
Abstract: Failed to fetch summary for 2410.23362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.23362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Real-Time Operator Takeover for Visuomotor Diffusion Policy Training
Marco Moletta, Michael C. Welle, Nils Ingelhag, Jesper Munkeby, Danica Kragic
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2502.02308: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02308&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] DeepRV: Accelerating Spatiotemporal Inference with Pre-trained Neural Priors
Jhonathan Navott, Daniel Jenson, Seth Flaxman, Elizaveta Semenova
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.21473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives
Brian Liu, Rahul Mazumder, Peter Radchenko
Main category: cs.LG
TL;DR: Paper ID 2506.20114 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.20114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] Inference on Optimal Policy Values and Other Irregular Functionals via Softmax Smoothing
Justin Whitehouse, Qizhao Chen, Morgane Austern, Vasilis Syrgkanis
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.11780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Bayesian Modeling and Estimation of Linear Time-Varying Systems using Neural Networks and Gaussian Processes
Yaniv Shulman
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.12878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] NES: An Instruction-Free, Low-Latency Next Edit Suggestion Framework Powered by Learned Historical Editing Trajectories
Xinfang Chen, Siyang Xiao, Xianying Zhu, Junhong Xie, Ming Liang, Dajun Chen, Wei Jiang, Yong Li, Peng Di
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.02473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] Bayesian Additive Regression Trees for functional ANOVA model
Seokhun Park, Insung Kong, Yongdai Kim
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.03317: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03317&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] A Machine Learning Based Explainability Framework for Interpreting Swarm Intelligence
Nitin Gupta, Bapi Dutta, Anupam Yadav
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.06272: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.06272&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] Joint Cooperative and Non-Cooperative Localization in WSNs with Distributed Scaled Proximal ADMM Algorithms
Qiaojia Zhu, Xiaojing Shen, Haiqi Liu, Pramod K. Varshney
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.18213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] The Pareto Frontier of Resilient Jet Tagging
Rikab Gambhir, Matt LeBlanc, Yuanchen Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - need to wait before retrying or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content due to HTTP 429 error from arXiv API.Method: Cannot determine method without access to the paper content due to HTTP 429 error from arXiv API.
Result: Cannot determine results without access to the paper content due to HTTP 429 error from arXiv API.
Conclusion: Cannot determine conclusion without access to the paper content due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2509.19431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] Smooth Quasar-Convex Optimization with Constraints
David Martínez-Rubio
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.01943: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01943&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] MCbiF: Measuring Topological Autocorrelation in Multiscale Clusterings via 2-Parameter Persistent Homology
Juni Schindler, Mauricio Barahona
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Cannot analyze method without access to the paper abstract or content
Result: No results available due to technical issue with arXiv API access
Conclusion: Paper analysis cannot be completed due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2510.14710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Which Similarity-Sensitive Entropy (Sentropy)?
Phuc Nguyen, Josiah Couch, Rahul Bansal, Alexandra Morgan, Chris Tam, Miao Li, Rima Arnaout, Ramy Arnaout
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.03849 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as paper content could not be fetched due to rate limiting error.Method: Unable to determine method as paper content could not be fetched due to rate limiting error.
Result: Unable to determine results as paper content could not be fetched due to rate limiting error.
Conclusion: Unable to determine conclusion as paper content could not be fetched due to rate limiting error.
Abstract: Failed to fetch summary for 2511.03849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Drift Estimation for Diffusion Processes Using Neural Networks Based on Discretely Observed Independent Paths
Yuzhen Zhao, Yating Liu, Marc Hoffmann
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.11161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] Faster Molecular Dynamics with Neural Network Potentials via Distilled Multiple Time-Stepping and Non-Conservative Forces
Nicolaï Gouraud, Côme Cattin, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.14975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] Entropy-Aware Task Offloading in Mobile Edge Computing
Mohsen Sahraei Ardakani, Hong Wan, Rui Song
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.16949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[599] The impact of multi-agent debate protocols on debate quality: a controlled case study
Ramtin Zargari Marandi
Main category: cs.MA
TL;DR: Multi-agent debate protocols (Within-Round, Cross-Round, Rank-Adaptive Cross-Round) are compared to isolate protocol effects from model effects, revealing trade-offs between peer-referencing and consensus formation.
Details
Motivation: Previous multi-agent debate (MAD) systems report performance gains but conflate protocol effects with model effects since debate protocols are typically held fixed while model factors vary. The authors aim to isolate protocol effects to understand how different debate structures influence outcomes.Method: Compare three debate protocols: Within-Round (WR) where agents see only current-round contributions, Cross-Round (CR) where agents see full prior-round context, and novel Rank-Adaptive Cross-Round (RA-CR) which dynamically reorders agents and silences one per round via an external judge model. These are tested against a No-Interaction baseline (NI) where agents respond independently without peer visibility. Evaluation uses a controlled macroeconomic case study with 20 diverse events, five random seeds, and matched prompts/decoding.
Result: RA-CR achieves faster convergence than CR, WR shows higher peer-referencing, and NI maximizes Argument Diversity (which remains unaffected across the main protocols). Results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation). When consensus is prioritized, RA-CR outperforms the other protocols.
Conclusion: Protocol design significantly matters in multi-agent debate systems, with different protocols offering different trade-offs between interaction and convergence. The novel RA-CR protocol is particularly effective when consensus formation is prioritized, demonstrating that carefully designed debate protocols can enhance system performance independent of model improvements.
Abstract: In multi-agent debate (MAD) systems, performance gains are often reported; however, because the debate protocol (e.g., number of agents, rounds, and aggregation rule) is typically held fixed while model-related factors vary, it is difficult to disentangle protocol effects from model effects. To isolate these effects, we compare three main protocols, Within-Round (WR; agents see only current-round contributions), Cross-Round (CR; full prior-round context), and novel Rank-Adaptive Cross-Round (RA-CR; dynamically reorders agents and silences one per round via an external judge model), against a No-Interaction baseline (NI; independent responses without peer visibility). In a controlled macroeconomic case study (20 diverse events, five random seeds, matched prompts/decoding), RA-CR achieves faster convergence than CR, WR shows higher peer-referencing, and NI maximizes Argument Diversity (unaffected across the main protocols). These results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation), confirming protocol design matters. When consensus is prioritized, RA-CR outperforms the others.
[600] An Empirical Study of Multi-Agent Collaboration for Automated Research
Yang Shen, Zhenyi Yi, Ziyi Zhao, Lijun Sun, Dongyang Li, Chin-Teng Lin, Yuhui Shi
Main category: cs.MA
TL;DR: Systematic empirical study comparing multi-agent coordination frameworks for automated machine learning optimization, revealing trade-offs between operational stability and theoretical deliberation.
Details
Motivation: As AI agents evolve from single LLMs to Multi-Agent Systems (MAS), the optimal coordination framework for autonomous agents remains unexplored, particularly for automated research tasks like machine learning optimization.Method: Used a controlled execution-based testbed with Git worktree isolation and explicit global memory to benchmark single-agent baseline against two multi-agent paradigms: subagent architecture (parallel exploration with post-hoc consolidation) and agent team architecture (experts with pre-execution handoffs), evaluated under fixed computational time budgets.
Result: Found fundamental trade-off: subagent mode is highly resilient and optimal for broad, shallow optimizations under strict time constraints, while agent team topology has higher operational fragility but achieves deep theoretical alignment for complex architectural refactoring with extended compute budgets.
Conclusion: Provides actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt collaborative structures to real-time task complexity.
Abstract: As AI agents evolve, the community is rapidly shifting from single Large Language Models (LLMs) to Multi-Agent Systems (MAS) to overcome cognitive bottlenecks in automated research. However, the optimal multi-agent coordination framework for these autonomous agents remains largely unexplored. In this paper, we present a systematic empirical study investigating the comparative efficacy of distinct multi-agent structures for automated machine learning optimization. Utilizing a rigorously controlled, execution-based testbed equipped with Git worktree isolation and explicit global memory, we benchmark a single-agent baseline against two multi-agent paradigms: a subagent architecture (parallel exploration with post-hoc consolidation) and an agent team architecture (experts with pre-execution handoffs). By evaluating these systems under strictly fixed computational time budgets, our findings reveal a fundamental trade-off between operational stability and theoretical deliberation. The subagent mode functions as a highly resilient, high-throughput search engine optimal for broad, shallow optimizations under strict time constraints. Conversely, the agent team topology exhibits higher operational fragility due to multi-author code generation but achieves the deep theoretical alignment necessary for complex architectural refactoring given extended compute budgets. These empirical insights provide actionable guidelines for designing future autoresearch systems, advocating for dynamically routed architectures that adapt their collaborative structures to real-time task complexity.
[601] IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning
Fan Yang, Soumya Teotia, Shaunak A. Mehta, Prajit KrisshnaKumar, Quanting Xie, Jun Liu, Yueqi Song, Wenkai Li, Atsunori Moteki, Kanji Uchino, Yonatan Bisk
Main category: cs.MA
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.20182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] “What Did It Actually Do?”: Understanding Risk Awareness and Traceability for Computer-Use Agents
Zifan Peng, Mingchen Li
Main category: cs.MA
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.28551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MM
[603] From Natural Alignment to Conditional Controllability in Multimodal Dialogue
Zeyu Jin, Songtao Zhou, Haoyu Wang, Minghao Tian, Kaifeng Yun, Zhuo Chen, Xiaoyu Qin, Jia Jia
Main category: cs.MM
TL;DR: Introduces MM-Dia dataset for multimodal dialogue generation with fine-grained annotations from movies/TV, enabling style-controllable speech synthesis and cross-modal consistency evaluation.
Details
Motivation: Current multimodal dialogue generation methods lack controllability and expressive diversity. Existing datasets insufficiently capture rich human interaction characteristics across speech, vision, and text modalities.Method: Developed novel multimodal dialogue annotation pipeline to curate dialogues from movies/TV series with fine-grained annotations. Created MM-Dia dataset (360+ hours, 54,700 dialogues) for explicit control and MM-Dia-Bench (309 dialogues) for implicit cross-modal evaluation.
Result: Training on MM-Dia significantly enhances fine-grained controllability in dialogue generation. Current frameworks struggle to replicate nuanced human expressiveness, as revealed by MM-Dia-Bench evaluations.
Conclusion: The work provides new insights and challenges for multimodal conditional dialogue generation, highlighting limitations in current models’ ability to capture complex human interaction expressiveness across modalities.
Abstract: The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
[604] Subjective Quality Assessment of Dynamic 3D Meshes in Virtual Reality Environment
Duc V. Nguyen, Nguyen Thi Quynh Ly, Truong Thu Huong
Main category: cs.MM
TL;DR: Subjective quality evaluation of dynamic 3D meshes in VR shows half of mesh faces can be removed without noticeable QoE degradation, leading to a new QoE prediction model and resource allocation framework.
Details
Motivation: Dynamic 3D meshes are crucial for VR but require heavy processing. Level-of-detail adjustments based on viewing distance can reduce processing while maintaining user experience, but systematic evaluation of how these factors affect perception was needed.Method: Conducted extensive subjective quality evaluation in VR environment with 320 test stimuli from 8 dynamic 3D meshes. Analyzed effects of level-of-detail and viewing distance on user perception, evaluated objective quality metrics, and developed novel QoE prediction model.
Result: Half of mesh faces can be removed without noticeable QoE degradation. Both 2D and 3D objective metrics have low correlation with subjective scores. Developed accurate QoE prediction model and QoE-aware resource allocation framework that significantly improves total QoE under resource constraints.
Conclusion: Systematic subjective evaluation reveals significant potential for mesh simplification in VR without compromising user experience. The proposed QoE prediction model and resource allocation framework enable efficient processing of dynamic 3D meshes while maintaining high quality of experience.
Abstract: A dynamic 3D mesh is a key component in Virtual Reality applications. However, this type of content demands a significant processing resource for real-time rendering. To reduce processing requirements while preserving the user experience, adjusting the level of detail of 3D meshes based on viewing distance has been proposed. In this paper, we conduct an extensive subjective quality evaluation to investigate the effects of the level of detail and viewing distance on user perception of dynamic 3D meshes in a VR environment. Our evaluation results in a subjective dataset containing user ratings of 320 test stimuli generated from eight dynamic 3D meshes. Result analysis shows that it is possible to remove half of a mesh’s faces without causing noticeable degradation in user Quality of Experience (QoE). An evaluation of popular objective quality metrics reveals that both 2D-based and 3D-based metrics have low correlation with subjective scores. Based on the subjective dataset, we develop a novel QoE prediction model that can accurately predict the MOS of a dynamic 3D mesh at a given level of detail and viewing distance. In addition, a QoE-aware resource allocation framework is proposed and evaluated under different resource constraints, showing significant improvement in the total QoE compared to conventional methods.
[605] Editing on the Generative Manifold: A Theoretical and Empirical Study of General Diffusion-Based Image Editing Trade-offs
Yi Hu, Leying Yi, Emily Davis, Finn Carter
Main category: cs.MM
TL;DR: Theoretical and empirical analysis of diffusion-based image editing, formalizing editing as guided transport on learned image manifolds and evaluating core desiderata like controllability, faithfulness, and locality.
Details
Motivation: Despite rapid progress in diffusion-based editing tools, there's no unified framework to evaluate core usability requirements: controllability, faithfulness to user intent, semantic consistency, locality, and perceptual quality.Method: Formalizes editing as operator induced by conditional reverse-time generative process; develops theory for edit dynamics under noise-injection/denoising transport, inversion-and-edit pipelines, and locality constraints; derives bounds connecting guidance strength and inversion error to non-target region deviations.
Result: Provides theoretical bounds under Lipschitz assumptions on learned score/flow fields, characterizes accumulation effects under iterative editing, and benchmarks representative paradigms with task-agnostic metrics.
Conclusion: Offers unified theoretical framework connecting diverse diffusion editing paradigms through common view of guided transport on image manifolds, enabling systematic evaluation of core editing desiderata.
Abstract: Diffusion-based editing has rapidly evolved from curated inpainting tools into general-purpose editors spanning text-guided instruction following, mask-localized edits, drag-based geometric manipulation, exemplar transfer, and training-free composition systems. Despite strong empirical progress, the field lacks a unified treatment of core desiderata that govern practical usability: controllability (how precisely and continuously the user can specify an edit), faithfulness to user intent (semantic alignment to instructions), semantic consistency (preservation of identity and non-target content), locality (containment of changes), and perceptual quality (artifact suppression and detail retention). This paper provides a theoretical and empirical analysis of general diffusion-based image editing, connecting diverse paradigms through a common view of editing as guided transport on a learned image manifold. We first formalize editing as an operator induced by a conditional reverse-time generative process and define task-agnostic metrics capturing instruction adherence, region preservation, semantic consistency, and stability under repeated edits. We then develop theory describing edit dynamics under (i) noise-injection and denoising transport, (ii) inversion-and-edit pipelines and the propagation of inversion errors, and (iii) locality constraints implemented via masked guidance or hard constraints. Under mild Lipschitz assumptions on the learned score or flow field, we derive bounds connecting guidance strength and inversion error to measurable deviations in non-target regions, and we characterize accumulation effects under iterative multi-turn editing. Empirically, we benchmark representative paradigms.
eess.AS
[606] Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation
Ui-Hyeop Shin, Hyung-Min Park
Main category: eess.AS
TL;DR: SR-CorrNet: A TF-domain speech separation model using separation-reconstruction strategy with correlation-based filter estimation for handling overlapping speakers, noise, and reverberation.
Details
Motivation: Speech separation in realistic acoustic environments is challenging due to overlapping speakers, background noise, and reverberation. Existing TF-domain models suffer from information bottlenecks in late-split architectures where speaker disentanglement is deferred to final stages.Method: Asymmetric encoder-decoder framework with separation-reconstruction strategy in TF dual-path backbone. Encoder performs coarse separation, decoder reconstructs speaker-discriminative features. Formulates separation as structured correlation-to-filter problem using spatio-spectro-temporal correlations as input features. Includes attractor-based dynamic split module.
Result: Consistent improvements on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS datasets across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings.
Conclusion: TF-domain separation-reconstruction with correlation-based filter estimation is effective for speech separation in diverse acoustic conditions.
Abstract: Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.
[607] Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition
Lukuang Dong, Ziwei Li, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou
Main category: eess.AS
TL;DR: Multilingual phoneme-to-grapheme conversion using LLMs with robustness strategies for ASR, achieving significant WER reduction on CV-Lang10 benchmark.
Details
Motivation: Phoneme-based ASR separates speech-to-phoneme (acoustic) and phoneme-to-grapheme (orthographic) tasks, enabling cross-lingual acoustic sharing. While LLMs show promise for P2G, multilingual P2G faces challenges including language-aware generation and severe cross-language data imbalance.Method: Study multilingual LLM-based P2G on CV-Lang10 benchmark (10 languages). Examine robustness strategies accounting for S2P uncertainty: DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Also employ robust training and low-resource oversampling.
Result: Robust training and low-resource oversampling reduce average WER from 10.56% to 7.66% on the CV-Lang10 benchmark.
Conclusion: The proposed methods effectively address multilingual P2G challenges, significantly improving ASR performance through robustness strategies and data balancing techniques.
Abstract: Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.
[608] VAANI: Capturing the language landscape for an inclusive digital India
Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu, Nihar Desai, Pavan Kumar J, Pranav D Bhat, Raghu Dharmaraju, Ritika Gupta, Sathvik Udupa, Saurabh Kumar, Sumit Sharma, Vaibhav Vishwakarma, Visruth Sanka, Dinesh Tewari, Harsh Dhand, Amrita Kamat, Sukhwinder Singh, Shikhar Vashishth, Partha Talukdar, Raj Acharya, Prasanta Kumar Ghosh
Main category: eess.AS
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.28714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
eess.IV
[609] End-to-end optimization of sparse ultrasound linear probes
Sergio Urrea, Adrian Basarab, Hervé Liebgott, Henry Arguello
Main category: eess.IV
TL;DR: End-to-end optimization framework jointly learns sparse ultrasound array configuration and image reconstruction using differentiable physics models and neural networks to reduce hardware complexity while maintaining image quality.
Details
Motivation: Ultrasound imaging faces trade-off between image quality and hardware complexity from dense transducer arrays. Sparse arrays can reduce complexity but need optimization to maintain image quality.Method: Proposes end-to-end framework integrating differentiable Image Formation Model with HARD Straight Through Estimator selection mask, unrolled Iterative Soft-Thresholding Algorithm deconvolution, and residual CNN. Combines physical consistency (PSF, convolutional model) with structural fidelity metrics (contrast, SLR, entropy, diversity).
Result: Simulations with 3.5MHz probe show learned configuration preserves axial and lateral resolution with half the active elements, enabling compact, cost-efficient ultrasound probe design without sacrificing image quality.
Conclusion: Physics-guided, data-driven approach enables optimized sparse array design for ultrasound imaging, expandable to 3-D volumetric imaging, balancing hardware complexity and image quality.
Abstract: Ultrasound imaging faces a trade-off between image quality and hardware complexity caused by dense transducers. Sparse arrays are one popular solution to mitigate this challenge. This work proposes an end-to-end optimization framework that jointly learns sparse array configuration and image reconstruction. The framework integrates a differentiable Image Formation Model with a HARD Straight Thought Estimator (STE) selection mask, unrolled Iterative Soft-Thresholding Algorithm (ISTA) deconvolution, and a residual Convolutional Neural Network (CNN). The objective combines physical consistency (Point Spread Function (PSF) and convolutional formation model) with structural fidelity (contrast, Side-Lobe-Ratio (SLR), entropy, and row diversity). Simulations using a 3.5,MHz probe show that the learned configuration preserves axial and lateral resolution with half of the active elements. This physics-guided, data-driven approach enables compact, cost-efficient ultrasound probe design without sacrificing image quality, and it is expandable to 3-D volumetric imaging.
[610] Retinal Malady Classification using AI: A novel ViT-SVM combination architecture
Shashwat Jha, Vishvaditya Luhach, Raju Poddar
Main category: eess.IV
TL;DR: Vision Transformer-SVM hybrid architecture for automated classification of retinal defects in OCT scans
Details
Motivation: Early detection of macular holes, central serous retinopathy, and diabetic retinopathy is crucial to prevent vision loss, requiring automated classification of OCT scansMethod: Hybrid Vision Transformer and Support Vector Machine (ViT-SVM) architecture for classifying optical coherence tomography scans of retinal defects
Result: The study analyzes the performance of the ViT-SVM hybrid architecture for automated detection of retinal defects from OCT scans
Conclusion: ViT-SVM hybrid approach shows promise for automating early detection of retinal diseases through OCT scan classification
Abstract: Macular Holes, Central serous retinopathy and Diabetic Retinopathy are one of the most widespread maladies of the eyes responsible for either partial or complete vision loss, thus making it clear that early detection of the mentioned defects is detrimental for the well-being of the patient. This study intends to introduce the application of Vision Transformer and Support Vector Machine based hybrid architecture (ViT-SVM) and analyse its performance to classify the optical coherence topography (OCT) Scans with the intention to automate the early detection of these retinal defects.
[611] Rich-U-Net: A medical image segmentation model for fusing spatial depth features and capturing minute structural details
Zhuoyi Fang, Kexuan Shi, Jiajia Liu, Qiang Han
Main category: eess.IV
TL;DR: Rich-U-Net improves medical image segmentation by integrating spatial and depth features through multi-level feature fusion for better fine structure detection.
Details
Motivation: Current medical image segmentation methods underperform in extracting spatial information and mining complex structures from medical images, limiting their ability to accurately segment fine anatomical details needed for clinical diagnosis.Method: Proposes Rich-U-Net model that effectively integrates both spatial and depth features through multi-level and multi-dimensional feature fusion and optimization strategies to enhance fine structure detection in complex medical images.
Result: Experiments on ISIC2018, BUSI, GLAS, and CVC datasets show Rich-U-Net surpasses state-of-the-art models in Dice, IoU, and HD95 metrics, demonstrating superior segmentation performance.
Conclusion: Rich-U-Net’s feature fusion approach enables fine structure localization and accurate segmentation in medical images, addressing limitations of existing methods in spatial information extraction.
Abstract: Medical image segmentation is of great significance in analysis of illness. The use of deep neural networks in medical image segmentation can help doctors extract regions of interest from complex medical images, thereby improving diagnostic accuracy and enabling better assessment of the condition to formulate treatment plans. However, most current medical image segmentation methods underperform in accurately extracting spatial information from medical images and mining potential complex structures and variations. In this article, we introduce the Rich-U-Net model, which effectively integrates both spatial and depth features. This fusion enhances the model’s capability to detect fine structures and intricate details within complex medical images. Our multi-level and multi-dimensional feature fusion and optimization strategies enable our model to achieve fine structure localization and accurate segmentation results in medical image segmentation. Experiments on the ISIC2018, BUSI, GLAS, and CVC datasets show that Rich-U-Net surpasses other state-of-the-art models in Dice, IoU, and HD95 metrics.
[612] Polyhedral Unmixing: Bridging Semantic Segmentation with Hyperspectral Unmixing via Polyhedral-Cone Partitioning
Antoine Bottenmuller, Etienne Decencière, Petr Dokládal
Main category: eess.IV
TL;DR: A novel pipeline connecting semantic segmentation and hyperspectral unmixing by showing pixel classification induces polyhedral-cone regions, enabling blind unmixing from any segmentation.
Details
Motivation: Semantic segmentation and hyperspectral unmixing are complementary but typically addressed independently. The paper aims to bridge these two problems to improve interpretability and give users explicit control over the unmixing process.Method: Under linear mixing model, shows pixel classification by dominant materials induces polyhedral-cone regions. Proposes segmentation-to-unmixing pipeline: uses any semantic segmentation to construct polyhedral-cone partition, computes signed distances, transforms via basis change, projects to probability simplex for initial abundance estimate, then extracts endmembers via matrix pseudo-inversion.
Result: Experiments on three real datasets show effectiveness when associated with appropriate clustering algorithms, with consistent improvements over recent deep and non-deep state-of-the-art methods.
Conclusion: The proposed approach successfully bridges segmentation and unmixing, offering interpretability, user control, and competitive performance while remaining lightweight and deterministic after segmentation.
Abstract: Semantic segmentation and hyperspectral unmixing are two central problems in spectral image analysis. The former assigns each pixel a discrete label corresponding to its material class, whereas the latter estimates pure material spectra, called endmembers, and, for each pixel, a vector representing material abundances in the observed scene. Despite their complementarity, these two problems are usually addressed independently. This paper aims to bridge these two lines of work by formally showing that, under the linear mixing model, pixel classification by dominant materials induces polyhedral-cone regions in the spectral space. We leverage this fundamental property to propose a direct segmentation-to-unmixing pipeline that performs blind hyperspectral unmixing from any semantic segmentation by constructing a polyhedral-cone partition of the space that best fits the labeled pixels. Signed distances from pixels to the estimated regions are then computed, linearly transformed via a change of basis in the distance space, and projected onto the probability simplex, yielding an initial abundance estimate. This estimate is used to extract endmembers and recover final abundances via matrix pseudo-inversion. Because the segmentation method can be freely chosen, the user gains explicit control over the unmixing process, while the rest of the pipeline remains essentially deterministic and lightweight. Beyond improving interpretability, experiments on three real datasets demonstrate the effectiveness of the proposed approach when associated with appropriate clustering algorithms, and show consistent improvements over recent deep and non-deep state-of-the-art methods. The code is available at: https://github.com/antoine-bottenmuller/polyhedral-unmixing
[613] Time-resolved aortic 3D shape reconstruction from a limited number of cine 2D MRI slices
Gloria Wolkerstorfer, Stefano Buoso, Rabea Schlenker, Jochen von Spiczak, Robert Manka, Sebastian Kozerke
Main category: eess.IV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.11873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)